diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 0000000000000000000000000000000000000000..81ea8f792645b1904e792918590eb215c62dd323
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,9 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+title: "OpenMMLab's Pre-training Toolbox and Benchmark"
+authors:
+  - name: "MMPreTrain Contributors"
+version: 0.15.0
+date-released: 2023-04-06
+repository-code: "https://github.com/open-mmlab/mmpretrain"
+license: Apache-2.0
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 0000000000000000000000000000000000000000..ce84c2a09f59785d3220a722b8ba1282c97a8030
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,73 @@
+# Contributing to MMPreTrain
+
+- [Contributing to MMPreTrain](#contributing-to-mmpretrain)
+  - [Workflow](#workflow)
+  - [Code style](#code-style)
+    - [Python](#python)
+    - [C++ and CUDA](#c-and-cuda)
+  - [Pre-commit Hook](#pre-commit-hook)
+
+Thanks for your interest in contributing to MMPreTrain! All kinds of contributions are welcome, including but not limited to the following.
+
+- Fix typo or bugs
+- Add documentation or translate the documentation into other languages
+- Add new features and components
+
+## Workflow
+
+We recommend the potential contributors follow this workflow for contribution.
+
+1. Fork and pull the latest MMPreTrain repository, follow [get started](https://mmpretrain.readthedocs.io/en/latest/get_started.html) to setup the environment.
+2. Checkout a new branch (**do not use the master or dev branch** for PRs)
+
+```bash
+git checkout -b xxxx # xxxx is the name of new branch
+```
+
+3. Edit the related files follow the code style mentioned below
+4. Use [pre-commit hook](https://pre-commit.com/) to check and format your changes.
+5. Commit your changes
+6. Create a PR with related information
+
+## Code style
+
+### Python
+
+We adopt [PEP8](https://www.python.org/dev/peps/pep-0008/) as the preferred code style.
+
+We use the following tools for linting and formatting:
+
+- [flake8](https://github.com/PyCQA/flake8): A wrapper around some linter tools.
+- [isort](https://github.com/timothycrosley/isort): A Python utility to sort imports.
+- [yapf](https://github.com/google/yapf): A formatter for Python files.
+- [codespell](https://github.com/codespell-project/codespell): A Python utility to fix common misspellings in text files.
+- [mdformat](https://github.com/executablebooks/mdformat): Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files.
+- [docformatter](https://github.com/myint/docformatter): A formatter to format docstring.
+
+Style configurations of yapf and isort can be found in [setup.cfg](https://github.com/open-mmlab/mmpretrain/blob/main/setup.cfg).
+
+### C++ and CUDA
+
+We follow the [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html).
+
+## Pre-commit Hook
+
+We use [pre-commit hook](https://pre-commit.com/) that checks and formats for `flake8`, `yapf`, `isort`, `trailing whitespaces`, `markdown files`,
+fixes `end-of-files`, `double-quoted-strings`, `python-encoding-pragma`, `mixed-line-ending`, sorts `requirments.txt` automatically on every commit.
+The config for a pre-commit hook is stored in [.pre-commit-config](https://github.com/open-mmlab/mmpretrain/blob/main/.pre-commit-config.yaml).
+
+After you clone the repository, you will need to install initialize pre-commit hook.
+
+```shell
+pip install -U pre-commit
+```
+
+From the repository folder
+
+```shell
+pre-commit install
+```
+
+After this on every commit check code linters and formatter will be enforced.
+
+> Before you create a PR, make sure that your code lints and is formatted by yapf.
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..ae87343779455c4c4b43e10a27d1657142666726
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,203 @@
+Copyright (c) OpenMMLab. All rights reserved
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright 2020 MMPreTrain Authors.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/MANIFEST.in b/MANIFEST.in
new file mode 100644
index 0000000000000000000000000000000000000000..ad4d8dafbdeb31327429c94430a8338e5f024acb
--- /dev/null
+++ b/MANIFEST.in
@@ -0,0 +1,5 @@
+include requirements/*.txt
+include mmpretrain/.mim/model-index.yml
+include mmpretrain/.mim/dataset-index.yml
+recursive-include mmpretrain/.mim/configs *.py *.yml
+recursive-include mmpretrain/.mim/tools *.py *.sh
diff --git a/README.md b/README.md
index a67d4b0aa6efa336d0b9be3f0244c77fc3f00c98..5318df5b958b8f54dcba1896776eebfb04ba9871 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,339 @@
-# mmpretrain
+<div align="center">
 
+<img src="resources/mmpt-logo.png" width="600"/>
+  <div>&nbsp;</div>
+  <div align="center">
+    <b><font size="5">OpenMMLab website</font></b>
+    <sup>
+      <a href="https://openmmlab.com">
+        <i><font size="4">HOT</font></i>
+      </a>
+    </sup>
+    &nbsp;&nbsp;&nbsp;&nbsp;
+    <b><font size="5">OpenMMLab platform</font></b>
+    <sup>
+      <a href="https://platform.openmmlab.com">
+        <i><font size="4">TRY IT OUT</font></i>
+      </a>
+    </sup>
+  </div>
+  <div>&nbsp;</div>
+
+[![PyPI](https://img.shields.io/pypi/v/mmpretrain)](https://pypi.org/project/mmpretrain)
+[![Docs](https://img.shields.io/badge/docs-latest-blue)](https://mmpretrain.readthedocs.io/en/latest/)
+[![Build Status](https://github.com/open-mmlab/mmpretrain/workflows/build/badge.svg)](https://github.com/open-mmlab/mmpretrain/actions)
+[![codecov](https://codecov.io/gh/open-mmlab/mmpretrain/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/mmpretrain)
+[![license](https://img.shields.io/github/license/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/blob/main/LICENSE)
+[![open issues](https://isitmaintained.com/badge/open/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+[![issue resolution](https://isitmaintained.com/badge/resolution/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+
+[📘 Documentation](https://mmpretrain.readthedocs.io/en/latest/) |
+[🛠️ Installation](https://mmpretrain.readthedocs.io/en/latest/get_started.html#installation) |
+[👀 Model Zoo](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html) |
+[🆕 Update News](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html) |
+[🤔 Reporting Issues](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+
+<img src="https://user-images.githubusercontent.com/36138628/230307505-4727ad0a-7d71-4069-939d-b499c7e272b7.png" width="400"/>
+
+English | [简体中文](/README_zh-CN.md)
+
+</div>
+
+</div>
+
+<div align="center">
+  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346637-d30c8a0f-3eba-4699-8131-512fb06d46db.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
+</div>
+
+## Introduction
+
+MMPreTrain is an open source pre-training toolbox based on PyTorch. It is a part of the [OpenMMLab](https://openmmlab.com/) project.
+
+The `main` branch works with **PyTorch 1.8+**.
+
+### Major features
+
+- Various backbones and pretrained models
+- Rich training strategies (supervised learning, self-supervised learning, multi-modality learning etc.)
+- Bag of training tricks
+- Large-scale training configs
+- High efficiency and extensibility
+- Powerful toolkits for model analysis and experiments
+- Various out-of-box inference tasks.
+  - Image Classification
+  - Image Caption
+  - Visual Question Answering
+  - Visual Grounding
+  - Retrieval (Image-To-Image, Text-To-Image, Image-To-Text)
+
+https://github.com/open-mmlab/mmpretrain/assets/26739999/e4dcd3a2-f895-4d1b-a351-fbc74a04e904
+
+## What's new
+
+🌟 v1.2.0 was released in 04/01/2023
+
+- Support LLaVA 1.5.
+- Implement of RAM with a gradio interface.
+
+🌟 v1.1.0 was released in 12/10/2023
+
+- Support Mini-GPT4 training and provide a Chinese model (based on Baichuan-7B)
+- Support zero-shot classification based on CLIP.
+
+🌟 v1.0.0 was released in 04/07/2023
+
+- Support inference of more **multi-modal** algorithms, such as [**LLaVA**](./configs/llava/), [**MiniGPT-4**](./configs/minigpt4), [**Otter**](./configs/otter/), etc.
+- Support around **10 multi-modal** datasets!
+- Add [**iTPN**](./configs/itpn/), [**SparK**](./configs/spark/) self-supervised learning algorithms.
+- Provide examples of [New Config](./mmpretrain/configs/) and [DeepSpeed/FSDP with FlexibleRunner](./configs/mae/benchmarks/). Here are the documentation links of [New Config](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#a-pure-python-style-configuration-file-beta) and [DeepSpeed/FSDP with FlexibleRunner](https://mmengine.readthedocs.io/en/latest/api/generated/mmengine.runner.FlexibleRunner.html#mmengine.runner.FlexibleRunner).
+
+🌟 Upgrade from MMClassification to MMPreTrain
+
+- Integrated Self-supervised learning algorithms from **MMSelfSup**, such as **MAE**, **BEiT**, etc.
+- Support **RIFormer**, a simple but effective vision backbone by removing token mixer.
+- Refactor dataset pipeline visualization.
+- Support **LeViT**, **XCiT**, **ViG**, **ConvNeXt-V2**, **EVA**, **RevViT**, **EfficientnetV2**, **CLIP**, **TinyViT** and **MixMIM** backbones.
+
+This release introduced a brand new and flexible training & test engine, but it's still in progress. Welcome
+to try according to [the documentation](https://mmpretrain.readthedocs.io/en/latest/).
+
+And there are some BC-breaking changes. Please check [the migration tutorial](https://mmpretrain.readthedocs.io/en/latest/migration.html).
+
+Please refer to [changelog](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html) for more details and other release history.
+
+## Installation
+
+Below are quick steps for installation:
+
+```shell
+conda create -n open-mmlab python=3.8 pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -y
+conda activate open-mmlab
+pip install openmim
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+mim install -e .
+```
+
+Please refer to [installation documentation](https://mmpretrain.readthedocs.io/en/latest/get_started.html) for more detailed installation and dataset preparation.
+
+For multi-modality models support, please install the extra dependencies by:
+
+```shell
+mim install -e ".[multimodal]"
+```
+
+## User Guides
+
+We provided a series of tutorials about the basic usage of MMPreTrain for new users:
+
+- [Learn about Configs](https://mmpretrain.readthedocs.io/en/latest/user_guides/config.html)
+- [Prepare Dataset](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html)
+- [Inference with existing models](https://mmpretrain.readthedocs.io/en/latest/user_guides/inference.html)
+- [Train](https://mmpretrain.readthedocs.io/en/latest/user_guides/train.html)
+- [Test](https://mmpretrain.readthedocs.io/en/latest/user_guides/test.html)
+- [Downstream tasks](https://mmpretrain.readthedocs.io/en/latest/user_guides/downstream.html)
+
+For more information, please refer to [our documentation](https://mmpretrain.readthedocs.io/en/latest/).
+
+## Model zoo
+
+Results and models are available in the [model zoo](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html).
+
+<div align="center">
+  <b>Overview</b>
+</div>
+<table align="center">
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>Supported Backbones</b>
+      </td>
+      <td>
+        <b>Self-supervised Learning</b>
+      </td>
+      <td>
+        <b>Multi-Modality Algorithms</b>
+      </td>
+      <td>
+        <b>Others</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+        <ul>
+        <li><a href="configs/vgg">VGG</a></li>
+        <li><a href="configs/resnet">ResNet</a></li>
+        <li><a href="configs/resnext">ResNeXt</a></li>
+        <li><a href="configs/seresnet">SE-ResNet</a></li>
+        <li><a href="configs/seresnet">SE-ResNeXt</a></li>
+        <li><a href="configs/regnet">RegNet</a></li>
+        <li><a href="configs/shufflenet_v1">ShuffleNet V1</a></li>
+        <li><a href="configs/shufflenet_v2">ShuffleNet V2</a></li>
+        <li><a href="configs/mobilenet_v2">MobileNet V2</a></li>
+        <li><a href="configs/mobilenet_v3">MobileNet V3</a></li>
+        <li><a href="configs/swin_transformer">Swin-Transformer</a></li>
+        <li><a href="configs/swin_transformer_v2">Swin-Transformer V2</a></li>
+        <li><a href="configs/repvgg">RepVGG</a></li>
+        <li><a href="configs/vision_transformer">Vision-Transformer</a></li>
+        <li><a href="configs/tnt">Transformer-in-Transformer</a></li>
+        <li><a href="configs/res2net">Res2Net</a></li>
+        <li><a href="configs/mlp_mixer">MLP-Mixer</a></li>
+        <li><a href="configs/deit">DeiT</a></li>
+        <li><a href="configs/deit3">DeiT-3</a></li>
+        <li><a href="configs/conformer">Conformer</a></li>
+        <li><a href="configs/t2t_vit">T2T-ViT</a></li>
+        <li><a href="configs/twins">Twins</a></li>
+        <li><a href="configs/efficientnet">EfficientNet</a></li>
+        <li><a href="configs/edgenext">EdgeNeXt</a></li>
+        <li><a href="configs/convnext">ConvNeXt</a></li>
+        <li><a href="configs/hrnet">HRNet</a></li>
+        <li><a href="configs/van">VAN</a></li>
+        <li><a href="configs/convmixer">ConvMixer</a></li>
+        <li><a href="configs/cspnet">CSPNet</a></li>
+        <li><a href="configs/poolformer">PoolFormer</a></li>
+        <li><a href="configs/inception_v3">Inception V3</a></li>
+        <li><a href="configs/mobileone">MobileOne</a></li>
+        <li><a href="configs/efficientformer">EfficientFormer</a></li>
+        <li><a href="configs/mvit">MViT</a></li>
+        <li><a href="configs/hornet">HorNet</a></li>
+        <li><a href="configs/mobilevit">MobileViT</a></li>
+        <li><a href="configs/davit">DaViT</a></li>
+        <li><a href="configs/replknet">RepLKNet</a></li>
+        <li><a href="configs/beit">BEiT</a></li>
+        <li><a href="configs/mixmim">MixMIM</a></li>
+        <li><a href="configs/efficientnet_v2">EfficientNet V2</a></li>
+        <li><a href="configs/revvit">RevViT</a></li>
+        <li><a href="configs/convnext_v2">ConvNeXt V2</a></li>
+        <li><a href="configs/vig">ViG</a></li>
+        <li><a href="configs/xcit">XCiT</a></li>
+        <li><a href="configs/levit">LeViT</a></li>
+        <li><a href="configs/riformer">RIFormer</a></li>
+        <li><a href="configs/glip">GLIP</a></li>
+        <li><a href="configs/sam">ViT SAM</a></li>
+        <li><a href="configs/eva02">EVA02</a></li>
+        <li><a href="configs/dinov2">DINO V2</a></li>
+        <li><a href="configs/hivit">HiViT</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/mocov2">MoCo V1 (CVPR'2020)</a></li>
+        <li><a href="configs/simclr">SimCLR (ICML'2020)</a></li>
+        <li><a href="configs/mocov2">MoCo V2 (arXiv'2020)</a></li>
+        <li><a href="configs/byol">BYOL (NeurIPS'2020)</a></li>
+        <li><a href="configs/swav">SwAV (NeurIPS'2020)</a></li>
+        <li><a href="configs/densecl">DenseCL (CVPR'2021)</a></li>
+        <li><a href="configs/simsiam">SimSiam (CVPR'2021)</a></li>
+        <li><a href="configs/barlowtwins">Barlow Twins (ICML'2021)</a></li>
+        <li><a href="configs/mocov3">MoCo V3 (ICCV'2021)</a></li>
+        <li><a href="configs/beit">BEiT (ICLR'2022)</a></li>
+        <li><a href="configs/mae">MAE (CVPR'2022)</a></li>
+        <li><a href="configs/simmim">SimMIM (CVPR'2022)</a></li>
+        <li><a href="configs/maskfeat">MaskFeat (CVPR'2022)</a></li>
+        <li><a href="configs/cae">CAE (arXiv'2022)</a></li>
+        <li><a href="configs/milan">MILAN (arXiv'2022)</a></li>
+        <li><a href="configs/beitv2">BEiT V2 (arXiv'2022)</a></li>
+        <li><a href="configs/eva">EVA (CVPR'2023)</a></li>
+        <li><a href="configs/mixmim">MixMIM (arXiv'2022)</a></li>
+        <li><a href="configs/itpn">iTPN (CVPR'2023)</a></li>
+        <li><a href="configs/spark">SparK (ICLR'2023)</a></li>
+        <li><a href="configs/mff">MFF (ICCV'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/blip">BLIP (arxiv'2022)</a></li>
+        <li><a href="configs/blip2">BLIP-2 (arxiv'2023)</a></li>
+        <li><a href="configs/ofa">OFA (CoRR'2022)</a></li>
+        <li><a href="configs/flamingo">Flamingo (NeurIPS'2022)</a></li>
+        <li><a href="configs/chinese_clip">Chinese CLIP (arxiv'2022)</a></li>
+        <li><a href="configs/minigpt4">MiniGPT-4 (arxiv'2023)</a></li>
+        <li><a href="configs/llava">LLaVA (arxiv'2023)</a></li>
+        <li><a href="configs/otter">Otter (arxiv'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+      Image Retrieval Task:
+        <ul>
+        <li><a href="configs/arcface">ArcFace (CVPR'2019)</a></li>
+        </ul>
+      Training&Test Tips:
+        <ul>
+        <li><a href="https://arxiv.org/abs/1909.13719">RandAug</a></li>
+        <li><a href="https://arxiv.org/abs/1805.09501">AutoAug</a></li>
+        <li><a href="mmpretrain/datasets/samplers/repeat_aug.py">RepeatAugSampler</a></li>
+        <li><a href="mmpretrain/models/tta/score_tta.py">TTA</a></li>
+        <li>...</li>
+        </ul>
+      </td>
+  </tbody>
+</table>
+
+## Contributing
+
+We appreciate all contributions to improve MMPreTrain.
+Please refer to [CONTRUBUTING](https://mmpretrain.readthedocs.io/en/latest/notes/contribution_guide.html) for the contributing guideline.
+
+## Acknowledgement
+
+MMPreTrain is an open source project that is contributed by researchers and engineers from various colleges and companies. We appreciate all the contributors who implement their methods or add new features, as well as users who give valuable feedbacks.
+We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and supporting their own academic research.
+
+## Citation
+
+If you find this project useful in your research, please consider cite:
+
+```BibTeX
+@misc{2023mmpretrain,
+    title={OpenMMLab's Pre-training Toolbox and Benchmark},
+    author={MMPreTrain Contributors},
+    howpublished = {\url{https://github.com/open-mmlab/mmpretrain}},
+    year={2023}
+}
+```
+
+## License
+
+This project is released under the [Apache 2.0 license](LICENSE).
+
+## Projects in OpenMMLab
+
+- [MMEngine](https://github.com/open-mmlab/mmengine): OpenMMLab foundational library for training deep learning models.
+- [MMCV](https://github.com/open-mmlab/mmcv): OpenMMLab foundational library for computer vision.
+- [MIM](https://github.com/open-mmlab/mim): MIM installs OpenMMLab packages.
+- [MMEval](https://github.com/open-mmlab/mmeval): A unified evaluation library for multiple machine learning libraries.
+- [MMPreTrain](https://github.com/open-mmlab/mmpretrain): OpenMMLab pre-training toolbox and benchmark.
+- [MMDetection](https://github.com/open-mmlab/mmdetection): OpenMMLab detection toolbox and benchmark.
+- [MMDetection3D](https://github.com/open-mmlab/mmdetection3d): OpenMMLab's next-generation platform for general 3D object detection.
+- [MMRotate](https://github.com/open-mmlab/mmrotate): OpenMMLab rotated object detection toolbox and benchmark.
+- [MMYOLO](https://github.com/open-mmlab/mmyolo): OpenMMLab YOLO series toolbox and benchmark.
+- [MMSegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab semantic segmentation toolbox and benchmark.
+- [MMOCR](https://github.com/open-mmlab/mmocr): OpenMMLab text detection, recognition, and understanding toolbox.
+- [MMPose](https://github.com/open-mmlab/mmpose): OpenMMLab pose estimation toolbox and benchmark.
+- [MMHuman3D](https://github.com/open-mmlab/mmhuman3d): OpenMMLab 3D human parametric model toolbox and benchmark.
+- [MMSelfSup](https://github.com/open-mmlab/mmselfsup): OpenMMLab self-supervised learning toolbox and benchmark.
+- [MMRazor](https://github.com/open-mmlab/mmrazor): OpenMMLab model compression toolbox and benchmark.
+- [MMFewShot](https://github.com/open-mmlab/mmfewshot): OpenMMLab fewshot learning toolbox and benchmark.
+- [MMAction2](https://github.com/open-mmlab/mmaction2): OpenMMLab's next-generation action understanding toolbox and benchmark.
+- [MMTracking](https://github.com/open-mmlab/mmtracking): OpenMMLab video perception toolbox and benchmark.
+- [MMFlow](https://github.com/open-mmlab/mmflow): OpenMMLab optical flow toolbox and benchmark.
+- [MMagic](https://github.com/open-mmlab/mmagic): Open**MM**Lab **A**dvanced, **G**enerative and **I**ntelligent **C**reation toolbox.
+- [MMGeneration](https://github.com/open-mmlab/mmgeneration): OpenMMLab image and video generative models toolbox.
+- [MMDeploy](https://github.com/open-mmlab/mmdeploy): OpenMMLab model deployment framework.
+- [Playground](https://github.com/open-mmlab/playground): A central hub for gathering and showcasing amazing projects built upon OpenMMLab.
diff --git a/README_zh-CN.md b/README_zh-CN.md
new file mode 100644
index 0000000000000000000000000000000000000000..9ee8dffc401d414c0c2b7135ba2a4887f80608a4
--- /dev/null
+++ b/README_zh-CN.md
@@ -0,0 +1,353 @@
+<div align="center">
+
+<img src="resources/mmpt-logo.png" width="600"/>
+  <div>&nbsp;</div>
+  <div align="center">
+    <b><font size="5">OpenMMLab 官网</font></b>
+    <sup>
+      <a href="https://openmmlab.com">
+        <i><font size="4">HOT</font></i>
+      </a>
+    </sup>
+    &nbsp;&nbsp;&nbsp;&nbsp;
+    <b><font size="5">OpenMMLab 开放平台</font></b>
+    <sup>
+      <a href="https://platform.openmmlab.com">
+        <i><font size="4">TRY IT OUT</font></i>
+      </a>
+    </sup>
+  </div>
+  <div>&nbsp;</div>
+
+[![PyPI](https://img.shields.io/pypi/v/mmpretrain)](https://pypi.org/project/mmpretrain)
+[![Docs](https://img.shields.io/badge/docs-latest-blue)](https://mmpretrain.readthedocs.io/zh_CN/latest/)
+[![Build Status](https://github.com/open-mmlab/mmpretrain/workflows/build/badge.svg)](https://github.com/open-mmlab/mmpretrain/actions)
+[![codecov](https://codecov.io/gh/open-mmlab/mmpretrain/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/mmpretrain)
+[![license](https://img.shields.io/github/license/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/blob/main/LICENSE)
+[![open issues](https://isitmaintained.com/badge/open/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+[![issue resolution](https://isitmaintained.com/badge/resolution/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+
+[📘 中文文档](https://mmpretrain.readthedocs.io/zh_CN/latest/) |
+[🛠️ 安装教程](https://mmpretrain.readthedocs.io/zh_CN/latest/get_started.html) |
+[👀 模型库](https://mmpretrain.readthedocs.io/zh_CN/latest/modelzoo_statistics.html) |
+[🆕 更新日志](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/changelog.html) |
+[🤔 报告问题](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+
+<img src="https://user-images.githubusercontent.com/36138628/230307505-4727ad0a-7d71-4069-939d-b499c7e272b7.png" width="400"/>
+
+[English](/README.md) | 简体中文
+
+</div>
+
+<div align="center">
+  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346637-d30c8a0f-3eba-4699-8131-512fb06d46db.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
+</div>
+
+## Introduction
+
+MMPreTrain 是一款基于 PyTorch 的开源深度学习预训练工具箱，是 [OpenMMLab](https://openmmlab.com/) 项目的成员之一
+
+`主分支`代码目前支持 PyTorch 1.8 以上的版本。
+
+### 主要特性
+
+- 支持多样的主干网络与预训练模型
+- 支持多种训练策略（有监督学习，无监督学习，多模态学习等）
+- 提供多种训练技巧
+- 大量的训练配置文件
+- 高效率和高可扩展性
+- 功能强大的工具箱，有助于模型分析和实验
+- 支持多种开箱即用的推理任务
+  - 图像分类
+  - 图像描述（Image Caption）
+  - 视觉问答（Visual Question Answering）
+  - 视觉定位（Visual Grounding）
+  - 检索（图搜图，图搜文，文搜图）
+
+https://github.com/open-mmlab/mmpretrain/assets/26739999/e4dcd3a2-f895-4d1b-a351-fbc74a04e904
+
+## 更新日志
+
+🌟 2024/01/04 发布了 v1.2.0 版本
+
+- 支持了 LLaVA 1.5
+- 实现了一个 RAM 模型的 gradio 推理例程
+
+🌟 2023/10/12 发布了 v1.1.0 版本
+
+- 支持 Mini-GPT4 训练并提供一个基于 Baichuan-7B 的中文模型
+- 支持基于 CLIP 的零样本分类。
+
+🌟 2023/7/4 发布了 v1.0.0 版本
+
+- 支持更多**多模态**算法的推理, 例如 [**LLaVA**](./configs/llava/), [**MiniGPT-4**](./configs/minigpt4), [**Otter**](./configs/otter/) 等。
+- 支持约 **10 个多模态**数据集!
+- 添加自监督学习算法 [**iTPN**](./configs/itpn/), [**SparK**](./configs/spark/)。
+- 提供[新配置文件](./mmpretrain/configs/)和 [DeepSpeed/FSDP](./configs/mae/benchmarks/) 的样例。这是[新配置文件](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#a-pure-python-style-configuration-file-beta) 和 [DeepSpeed/FSDP with FlexibleRunner](https://mmengine.readthedocs.io/en/latest/api/generated/mmengine.runner.FlexibleRunner.html#mmengine.runner.FlexibleRunner) 的文档链接。
+
+🌟 从 MMClassification 升级到 MMPreTrain
+
+- 整合来自 MMSelfSup 的自监督学习算法，例如 `MAE`, `BEiT` 等
+- 支持了 **RIFormer**，简单但有效的视觉主干网络，却移除了 token mixer
+- 重构数据管道可视化
+- 支持了 **LeViT**, **XCiT**, **ViG**, **ConvNeXt-V2**, **EVA**, **RevViT**, **EfficientnetV2**, **CLIP**, **TinyViT** 和 **MixMIM** 等骨干网络结构
+
+这个版本引入一个全新的，可扩展性强的训练和测试引擎，但目前仍在开发中。欢迎根据 [文档](https://mmpretrain.readthedocs.io/zh_CN/latest/) 进行试用。
+
+同时，新版本中存在一些与旧版本不兼容的修改。请查看 [迁移文档](https://mmpretrain.readthedocs.io/zh_CN/latest/migration.html) 来详细了解这些变动。
+
+发布历史和更新细节请参考 [更新日志](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/changelog.html)。
+
+## 安装
+
+以下是安装的简要步骤：
+
+```shell
+conda create -n open-mmlab python=3.8 pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -y
+conda activate open-mmlab
+pip3 install openmim
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+mim install -e .
+```
+
+更详细的步骤请参考 [安装指南](https://mmpretrain.readthedocs.io/zh_CN/latest/get_started.html) 进行安装。
+
+如果需要多模态模型，请使用如下方式安装额外的依赖：
+
+```shell
+mim install -e ".[multimodal]"
+```
+
+## 基础教程
+
+我们为新用户提供了一系列基础教程：
+
+- [学习配置文件](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/config.html)
+- [准备数据集](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/dataset_prepare.html)
+- [使用现有模型推理](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/inference.html)
+- [训练](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/train.html)
+- [测试](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/test.html)
+- [下游任务](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/downstream.html)
+
+关于更多的信息，请查阅我们的 [相关文档](https://mmpretrain.readthedocs.io/zh_CN/latest/)。
+
+## 模型库
+
+相关结果和模型可在 [模型库](https://mmpretrain.readthedocs.io/zh_CN/latest/modelzoo_statistics.html) 中获得。
+
+<div align="center">
+  <b>概览</b>
+</div>
+<table align="center">
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>支持的主干网络</b>
+      </td>
+      <td>
+        <b>自监督学习</b>
+      </td>
+      <td>
+        <b>多模态算法</b>
+      </td>
+      <td>
+        <b>其它</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+        <ul>
+        <li><a href="configs/vgg">VGG</a></li>
+        <li><a href="configs/resnet">ResNet</a></li>
+        <li><a href="configs/resnext">ResNeXt</a></li>
+        <li><a href="configs/seresnet">SE-ResNet</a></li>
+        <li><a href="configs/seresnet">SE-ResNeXt</a></li>
+        <li><a href="configs/regnet">RegNet</a></li>
+        <li><a href="configs/shufflenet_v1">ShuffleNet V1</a></li>
+        <li><a href="configs/shufflenet_v2">ShuffleNet V2</a></li>
+        <li><a href="configs/mobilenet_v2">MobileNet V2</a></li>
+        <li><a href="configs/mobilenet_v3">MobileNet V3</a></li>
+        <li><a href="configs/swin_transformer">Swin-Transformer</a></li>
+        <li><a href="configs/swin_transformer_v2">Swin-Transformer V2</a></li>
+        <li><a href="configs/repvgg">RepVGG</a></li>
+        <li><a href="configs/vision_transformer">Vision-Transformer</a></li>
+        <li><a href="configs/tnt">Transformer-in-Transformer</a></li>
+        <li><a href="configs/res2net">Res2Net</a></li>
+        <li><a href="configs/mlp_mixer">MLP-Mixer</a></li>
+        <li><a href="configs/deit">DeiT</a></li>
+        <li><a href="configs/deit3">DeiT-3</a></li>
+        <li><a href="configs/conformer">Conformer</a></li>
+        <li><a href="configs/t2t_vit">T2T-ViT</a></li>
+        <li><a href="configs/twins">Twins</a></li>
+        <li><a href="configs/efficientnet">EfficientNet</a></li>
+        <li><a href="configs/edgenext">EdgeNeXt</a></li>
+        <li><a href="configs/convnext">ConvNeXt</a></li>
+        <li><a href="configs/hrnet">HRNet</a></li>
+        <li><a href="configs/van">VAN</a></li>
+        <li><a href="configs/convmixer">ConvMixer</a></li>
+        <li><a href="configs/cspnet">CSPNet</a></li>
+        <li><a href="configs/poolformer">PoolFormer</a></li>
+        <li><a href="configs/inception_v3">Inception V3</a></li>
+        <li><a href="configs/mobileone">MobileOne</a></li>
+        <li><a href="configs/efficientformer">EfficientFormer</a></li>
+        <li><a href="configs/mvit">MViT</a></li>
+        <li><a href="configs/hornet">HorNet</a></li>
+        <li><a href="configs/mobilevit">MobileViT</a></li>
+        <li><a href="configs/davit">DaViT</a></li>
+        <li><a href="configs/replknet">RepLKNet</a></li>
+        <li><a href="configs/beit">BEiT</a></li>
+        <li><a href="configs/mixmim">MixMIM</a></li>
+        <li><a href="configs/revvit">RevViT</a></li>
+        <li><a href="configs/convnext_v2">ConvNeXt V2</a></li>
+        <li><a href="configs/vig">ViG</a></li>
+        <li><a href="configs/xcit">XCiT</a></li>
+        <li><a href="configs/levit">LeViT</a></li>
+        <li><a href="configs/riformer">RIFormer</a></li>
+        <li><a href="configs/glip">GLIP</a></li>
+        <li><a href="configs/sam">ViT SAM</a></li>
+        <li><a href="configs/eva02">EVA02</a></li>
+        <li><a href="configs/dinov2">DINO V2</a></li>
+        <li><a href="configs/hivit">HiViT</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/mocov2">MoCo V1 (CVPR'2020)</a></li>
+        <li><a href="configs/simclr">SimCLR (ICML'2020)</a></li>
+        <li><a href="configs/mocov2">MoCo V2 (arXiv'2020)</a></li>
+        <li><a href="configs/byol">BYOL (NeurIPS'2020)</a></li>
+        <li><a href="configs/swav">SwAV (NeurIPS'2020)</a></li>
+        <li><a href="configs/densecl">DenseCL (CVPR'2021)</a></li>
+        <li><a href="configs/simsiam">SimSiam (CVPR'2021)</a></li>
+        <li><a href="configs/barlowtwins">Barlow Twins (ICML'2021)</a></li>
+        <li><a href="configs/mocov3">MoCo V3 (ICCV'2021)</a></li>
+        <li><a href="configs/beit">BEiT (ICLR'2022)</a></li>
+        <li><a href="configs/mae">MAE (CVPR'2022)</a></li>
+        <li><a href="configs/simmim">SimMIM (CVPR'2022)</a></li>
+        <li><a href="configs/maskfeat">MaskFeat (CVPR'2022)</a></li>
+        <li><a href="configs/cae">CAE (arXiv'2022)</a></li>
+        <li><a href="configs/milan">MILAN (arXiv'2022)</a></li>
+        <li><a href="configs/beitv2">BEiT V2 (arXiv'2022)</a></li>
+        <li><a href="configs/eva">EVA (CVPR'2023)</a></li>
+        <li><a href="configs/mixmim">MixMIM (arXiv'2022)</a></li>
+        <li><a href="configs/itpn">iTPN (CVPR'2023)</a></li>
+        <li><a href="configs/spark">SparK (ICLR'2023)</a></li>
+        <li><a href="configs/mff">MFF (ICCV'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/blip">BLIP (arxiv'2022)</a></li>
+        <li><a href="configs/blip2">BLIP-2 (arxiv'2023)</a></li>
+        <li><a href="configs/ofa">OFA (CoRR'2022)</a></li>
+        <li><a href="configs/flamingo">Flamingo (NeurIPS'2022)</a></li>
+        <li><a href="configs/chinese_clip">Chinese CLIP (arxiv'2022)</a></li>
+        <li><a href="configs/minigpt4">MiniGPT-4 (arxiv'2023)</a></li>
+        <li><a href="configs/llava">LLaVA (arxiv'2023)</a></li>
+        <li><a href="configs/otter">Otter (arxiv'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+      图像检索任务：
+        <ul>
+        <li><a href="configs/arcface">ArcFace (CVPR'2019)</a></li>
+        </ul>
+      训练和测试 Tips:
+        <ul>
+        <li><a href="https://arxiv.org/abs/1909.13719">RandAug</a></li>
+        <li><a href="https://arxiv.org/abs/1805.09501">AutoAug</a></li>
+        <li><a href="mmpretrain/datasets/samplers/repeat_aug.py">RepeatAugSampler</a></li>
+        <li><a href="mmpretrain/models/tta/score_tta.py">TTA</a></li>
+        <li>...</li>
+        </ul>
+      </td>
+  </tbody>
+</table>
+
+## 参与贡献
+
+我们非常欢迎任何有助于提升 MMPreTrain 的贡献，请参考 [贡献指南](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/contribution_guide.html) 来了解如何参与贡献。
+
+## 致谢
+
+MMPreTrain 是一款由不同学校和公司共同贡献的开源项目。我们感谢所有为项目提供算法复现和新功能支持的贡献者，以及提供宝贵反馈的用户。
+我们希望该工具箱和基准测试可以为社区提供灵活的代码工具，供用户复现现有算法并开发自己的新模型，从而不断为开源社区提供贡献。
+
+## 引用
+
+如果你在研究中使用了本项目的代码或者性能基准，请参考如下 bibtex 引用 MMPreTrain。
+
+```BibTeX
+@misc{2023mmpretrain,
+    title={OpenMMLab's Pre-training Toolbox and Benchmark},
+    author={MMPreTrain Contributors},
+    howpublished = {\url{https://github.com/open-mmlab/mmpretrain}},
+    year={2023}
+}
+```
+
+## 许可证
+
+该项目开源自 [Apache 2.0 license](LICENSE).
+
+## OpenMMLab 的其他项目
+
+- [MMEngine](https://github.com/open-mmlab/mmengine): OpenMMLab 深度学习模型训练基础库
+- [MMCV](https://github.com/open-mmlab/mmcv): OpenMMLab 计算机视觉基础库
+- [MIM](https://github.com/open-mmlab/mim): MIM 是 OpenMMlab 项目、算法、模型的统一入口
+- [MMEval](https://github.com/open-mmlab/mmeval): 统一开放的跨框架算法评测库
+- [MMPreTrain](https://github.com/open-mmlab/mmpretrain): OpenMMLab 深度学习预训练工具箱
+- [MMDetection](https://github.com/open-mmlab/mmdetection): OpenMMLab 目标检测工具箱
+- [MMDetection3D](https://github.com/open-mmlab/mmdetection3d): OpenMMLab 新一代通用 3D 目标检测平台
+- [MMRotate](https://github.com/open-mmlab/mmrotate): OpenMMLab 旋转框检测工具箱与测试基准
+- [MMYOLO](https://github.com/open-mmlab/mmyolo): OpenMMLab YOLO 系列工具箱与测试基准
+- [MMSegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab 语义分割工具箱
+- [MMOCR](https://github.com/open-mmlab/mmocr): OpenMMLab 全流程文字检测识别理解工具包
+- [MMPose](https://github.com/open-mmlab/mmpose): OpenMMLab 姿态估计工具箱
+- [MMHuman3D](https://github.com/open-mmlab/mmhuman3d): OpenMMLab 人体参数化模型工具箱与测试基准
+- [MMSelfSup](https://github.com/open-mmlab/mmselfsup): OpenMMLab 自监督学习工具箱与测试基准
+- [MMRazor](https://github.com/open-mmlab/mmrazor): OpenMMLab 模型压缩工具箱与测试基准
+- [MMFewShot](https://github.com/open-mmlab/mmfewshot): OpenMMLab 少样本学习工具箱与测试基准
+- [MMAction2](https://github.com/open-mmlab/mmaction2): OpenMMLab 新一代视频理解工具箱
+- [MMTracking](https://github.com/open-mmlab/mmtracking): OpenMMLab 一体化视频目标感知平台
+- [MMFlow](https://github.com/open-mmlab/mmflow): OpenMMLab 光流估计工具箱与测试基准
+- [MMagic](https://github.com/open-mmlab/mmagic): OpenMMLab 新一代人工智能内容生成（AIGC）工具箱
+- [MMGeneration](https://github.com/open-mmlab/mmgeneration): OpenMMLab 图片视频生成模型工具箱
+- [MMDeploy](https://github.com/open-mmlab/mmdeploy): OpenMMLab 模型部署框架
+- [Playground](https://github.com/open-mmlab/playground): 收集和展示 OpenMMLab 相关的前沿、有趣的社区项目
+
+## 欢迎加入 OpenMMLab 社区
+
+扫描下方的二维码可关注 OpenMMLab 团队的 [知乎官方账号](https://www.zhihu.com/people/openmmlab)，扫描下方微信二维码添加喵喵好友，进入 MMPretrain 微信交流社群。【加好友申请格式：研究方向+地区+学校/公司+姓名】
+
+<div align="center">
+<img src="./resources/zhihu_qrcode.jpg" height="400"/> <img src="./resources/miaomiao_qrcode.jpg" height="400"/>
+</div>
+
+我们会在 OpenMMLab 社区为大家
+
+- 📢 分享 AI 框架的前沿核心技术
+- 💻 解读 PyTorch 常用模块源码
+- 📰 发布 OpenMMLab 的相关新闻
+- 🚀 介绍 OpenMMLab 开发的前沿算法
+- 🏃 获取更高效的问题答疑和意见反馈
+- 🔥 提供与各行各业开发者充分交流的平台
+
+干货满满 📘，等你来撩 💗，OpenMMLab 社区期待您的加入 👬
diff --git a/configs/_base_/datasets/cifar100_bs16.py b/configs/_base_/datasets/cifar100_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..67477db0367fa1356c4514a46f4b43d56b4c5822
--- /dev/null
+++ b/configs/_base_/datasets/cifar100_bs16.py
@@ -0,0 +1,45 @@
+# dataset settings
+dataset_type = 'CIFAR100'
+data_preprocessor = dict(
+    num_classes=100,
+    # RGB format normalization parameters
+    mean=[129.304, 124.070, 112.434],
+    std=[68.170, 65.392, 70.418],
+    # loaded images are already RGB format
+    to_rgb=False)
+
+train_pipeline = [
+    dict(type='RandomCrop', crop_size=32, padding=4),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar100',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar100/',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/cifar10_bs16.py b/configs/_base_/datasets/cifar10_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..408be35da845a39bf7058eb9c3ce5549295b3822
--- /dev/null
+++ b/configs/_base_/datasets/cifar10_bs16.py
@@ -0,0 +1,45 @@
+# dataset settings
+dataset_type = 'CIFAR10'
+data_preprocessor = dict(
+    num_classes=10,
+    # RGB format normalization parameters
+    mean=[125.307, 122.961, 113.8575],
+    std=[51.5865, 50.847, 51.255],
+    # loaded images are already RGB format
+    to_rgb=False)
+
+train_pipeline = [
+    dict(type='RandomCrop', crop_size=32, padding=4),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar10',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar10/',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_caption.py b/configs/_base_/datasets/coco_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..5346111273d4120581fe854583c99f6b94e7e873
--- /dev/null
+++ b/configs/_base_/datasets/coco_caption.py
@@ -0,0 +1,70 @@
+# data settings
+# coco caption annotations can be grabbed from LAVIS repo
+# https://github.com/salesforce/LAVIS/blob/main/lavis/configs/datasets/coco/defaults_cap.yaml
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='CleanCaption', keys='gt_caption'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption'],
+        meta_keys=['image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_train.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# # If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_okvqa.py b/configs/_base_/datasets/coco_okvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..16f1577dbb5e5c7c14186f2523e94e0aeffc4b54
--- /dev/null
+++ b/configs/_base_/datasets/coco_okvqa.py
@@ -0,0 +1,75 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='train2014',
+        question_file=
+        'annotations/okvqa_OpenEnded_mscoco_train2014_questions.json',
+        ann_file='annotations/okvqa_mscoco_train2014_annotations.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file=
+        'annotations/okvqa_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/okvqa_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_retrieval.py b/configs/_base_/datasets/coco_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f6b802a3854fd029c476d78296edbc9bffd4e75
--- /dev/null
+++ b/configs/_base_/datasets/coco_retrieval.py
@@ -0,0 +1,99 @@
+# data settings
+# Here are the links to download the annotations for coco retrieval for conveniency # noqa
+# https://download.openmmlab.com/mmclassification/datasets/coco_retrieval/caption_karpathy_train2014.json
+# https://download.openmmlab.com/mmclassification/datasets/coco_retrieval/caption_karpathy_val2014.json
+# https://download.openmmlab.com/mmclassification/datasets/coco_retrieval/caption_karpathy_test2014.json
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+rand_increasing_policies = [
+    dict(type='AutoContrast'),
+    dict(type='Equalize'),
+    dict(type='Rotate', magnitude_key='angle', magnitude_range=(0, 30)),
+    dict(
+        type='Brightness', magnitude_key='magnitude',
+        magnitude_range=(0, 0.0)),
+    dict(type='Sharpness', magnitude_key='magnitude', magnitude_range=(0, 0)),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='horizontal'),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='vertical'),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        crop_ratio_range=(0.5, 1.0),
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies=rand_increasing_policies,
+        num_policies=2,
+        magnitude_level=5),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'is_matched'],
+        meta_keys=['image_id']),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_text_id', 'gt_image_id'],
+        meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=16,
+    dataset=dict(
+        type='COCORetrieval',
+        data_root='data/coco',
+        ann_file='annotations/caption_karpathy_train2014.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=16,
+    dataset=dict(
+        type='COCORetrieval',
+        data_root='data/coco',
+        ann_file='annotations/caption_karpathy_val2014.json',
+        pipeline=test_pipeline,
+        # This is required for evaluation
+        test_mode=True,
+    ),
+    sampler=dict(type='SequentialSampler', subsample_type='sequential'),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(type='RetrievalRecall', topk=(1, 5, 10))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_vg_vqa.py b/configs/_base_/datasets/coco_vg_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ba0eac46853c1a477e2c6b2bc3dcddbbf7e5423
--- /dev/null
+++ b/configs/_base_/datasets/coco_vg_vqa.py
@@ -0,0 +1,96 @@
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=(480, 480),
+        crop_ratio_range=(0.5, 1.0),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='simple_increasing',  # slightly different from LAVIS
+        num_policies=2,
+        magnitude_level=5),
+    dict(type='CleanCaption', keys=['question', 'gt_answer']),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight']),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys=['question']),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question'],
+        meta_keys=['question_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='ConcatDataset',
+        datasets=[
+            # VQAv2 train
+            dict(
+                type='COCOVQA',
+                data_root='data/coco',
+                data_prefix='train2014',
+                question_file=
+                'annotations/v2_OpenEnded_mscoco_train2014_questions.json',
+                ann_file='annotations/v2_mscoco_train2014_annotations.json',
+                pipeline=train_pipeline,
+            ),
+            # VQAv2 val
+            dict(
+                type='COCOVQA',
+                data_root='data/coco',
+                data_prefix='val2014',
+                question_file=
+                'annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+                ann_file='annotations/v2_mscoco_val2014_annotations.json',
+                pipeline=train_pipeline,
+            ),
+            # Visual Genome
+            dict(
+                type='VisualGenomeQA',
+                data_root='visual_genome',
+                data_prefix='image',
+                ann_file='question_answers.json',
+                pipeline=train_pipeline,
+            )
+        ]),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test2015_questions.json',  # noqa: E501
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test.json')
diff --git a/configs/_base_/datasets/coco_vqa.py b/configs/_base_/datasets/coco_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..7fb16bd241b357a897b168ceff5450b6e7f2dc80
--- /dev/null
+++ b/configs/_base_/datasets/coco_vqa.py
@@ -0,0 +1,84 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='train2014',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_train2014_questions.json',
+        ann_file='annotations/v2_mscoco_train2014_annotations.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=  # noqa: E251
+        'annotations/v2_OpenEnded_mscoco_test2015_questions.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test.json')
diff --git a/configs/_base_/datasets/cub_bs8_384.py b/configs/_base_/datasets/cub_bs8_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..24b3a9ffd4df6987716f15a42cc2e3d02c436b90
--- /dev/null
+++ b/configs/_base_/datasets/cub_bs8_384.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'CUB'
+data_preprocessor = dict(
+    num_classes=200,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=510),
+    dict(type='RandomCrop', crop_size=384),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=510),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/cub_bs8_448.py b/configs/_base_/datasets/cub_bs8_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0bc7b7e1fbd308763c68e1b6302669c705e8f41
--- /dev/null
+++ b/configs/_base_/datasets/cub_bs8_448.py
@@ -0,0 +1,50 @@
+# dataset settings
+dataset_type = 'CUB'
+data_preprocessor = dict(
+    num_classes=200,
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=600),
+    dict(type='RandomCrop', crop_size=448),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=600),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/flickr30k_caption.py b/configs/_base_/datasets/flickr30k_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..a902b5291f1df0df719f570538385a1c75dfccfd
--- /dev/null
+++ b/configs/_base_/datasets/flickr30k_caption.py
@@ -0,0 +1,92 @@
+# data settings
+
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='CleanCaption', keys='gt_caption'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption'],
+        meta_keys=['image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type='Flickr30kCaption',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type='Flickr30kCaption',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='val',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+# refer tools/dataset_converters/convert_flickr30k_ann.py
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/flickr30k_val_gt.json',
+)
+
+# # If you want standard test, please manually configure the test dataset
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type='Flickr30kCaption',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='test',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+# refer tools/dataset_converters/convert_flickr30k_ann.py
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/flickr30k_test_gt.json',
+)
diff --git a/configs/_base_/datasets/flickr30k_retrieval.py b/configs/_base_/datasets/flickr30k_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..acbc645b92214599d77cd9f3ecc70e9b7235b8e5
--- /dev/null
+++ b/configs/_base_/datasets/flickr30k_retrieval.py
@@ -0,0 +1,112 @@
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+rand_increasing_policies = [
+    dict(type='AutoContrast'),
+    dict(type='Equalize'),
+    dict(type='Rotate', magnitude_key='angle', magnitude_range=(0, 30)),
+    dict(
+        type='Brightness', magnitude_key='magnitude',
+        magnitude_range=(0, 0.0)),
+    dict(type='Sharpness', magnitude_key='magnitude', magnitude_range=(0, 0)),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='horizontal'),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='vertical'),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        crop_ratio_range=(0.5, 1.0),
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies=rand_increasing_policies,
+        num_policies=2,
+        magnitude_level=5),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'is_matched'],
+        meta_keys=['image_id']),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_text_id', 'gt_image_id'],
+        meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=16,
+    dataset=dict(
+        type='Flickr30kRetrieval',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=16,
+    dataset=dict(
+        type='Flickr30kRetrieval',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='val',
+        pipeline=test_pipeline,
+        test_mode=True,  # This is required for evaluation
+    ),
+    sampler=dict(type='SequentialSampler', subsample_type='sequential'),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(type='RetrievalRecall', topk=(1, 5, 10))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = dict(
+    batch_size=64,
+    num_workers=16,
+    dataset=dict(
+        type='Flickr30kRetrieval',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='test',
+        pipeline=test_pipeline,
+        test_mode=True,  # This is required for evaluation
+    ),
+    sampler=dict(type='SequentialSampler', subsample_type='sequential'),
+    persistent_workers=True,
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/gqa.py b/configs/_base_/datasets/gqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..872ab451f32dd9cff87890c943a5ed1dc7ecb517
--- /dev/null
+++ b/configs/_base_/datasets/gqa.py
@@ -0,0 +1,81 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='GQA',
+        data_root='data/gqa',
+        data_prefix='images',
+        ann_file='annotations/train_balanced_questions.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='GQA',
+        data_root='data/gqa',
+        data_prefix='images',
+        ann_file='annotations/testdev_balanced_questions.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='GQAAcc')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='GQA',
+        data_root='data/gqa',
+        data_prefix='images',
+        ann_file='annotations/testdev_balanced_questions.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet21k_bs128.py b/configs/_base_/datasets/imagenet21k_bs128.py
new file mode 100644
index 0000000000000000000000000000000000000000..38bfd351bf8f49ae18d21492c6fc656a7b2ecc45
--- /dev/null
+++ b/configs/_base_/datasets/imagenet21k_bs128.py
@@ -0,0 +1,28 @@
+# dataset settings
+dataset_type = 'ImageNet21k'
+data_preprocessor = dict(
+    num_classes=21842,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet21k',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
diff --git a/configs/_base_/datasets/imagenet_bs128_mbv3.py b/configs/_base_/datasets/imagenet_bs128_mbv3.py
new file mode 100644
index 0000000000000000000000000000000000000000..d355f507bf8e2be5d9efc3cc777e9854196b9d64
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_mbv3.py
@@ -0,0 +1,66 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.2,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_poolformer_medium_224.py b/configs/_base_/datasets/imagenet_bs128_poolformer_medium_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..be90a655674e22c3341c185c7be5532b1bef8cf1
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_poolformer_medium_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=236,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_poolformer_small_224.py b/configs/_base_/datasets/imagenet_bs128_poolformer_small_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9e0f071ade1feccf6a3f96ef7ad8f28c693e84c
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_poolformer_small_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_revvit_224.py b/configs/_base_/datasets/imagenet_bs128_revvit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd87aaf033b08dd94b5a684eed759072ff6fd4e9
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_revvit_224.py
@@ -0,0 +1,83 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',  # should be 'pixel', but currently not supported
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_riformer_medium_384.py b/configs/_base_/datasets/imagenet_bs128_riformer_medium_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..151ded7895b378ba7e6bf5895fb11d903841b95d
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_riformer_medium_384.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=404,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_riformer_small_384.py b/configs/_base_/datasets/imagenet_bs128_riformer_small_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea9799ba9c41fcbaf049a54d9776750c860a598c
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_riformer_small_384.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=426,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_vig_224.py b/configs/_base_/datasets/imagenet_bs128_vig_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..abb0182a6ce53202bee905bcd3849b851852b4b4
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_vig_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_196.py b/configs/_base_/datasets/imagenet_bs16_eva_196.py
new file mode 100644
index 0000000000000000000000000000000000000000..f668e1d6e56ab4c5e311af912fe4b560a3a12bfd
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_196.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=196,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=196,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=196),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_336.py b/configs/_base_/datasets/imagenet_bs16_eva_336.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2c770af0f58a4db5d0435807f3cc9b499d01295
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_336.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=336,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=336,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=336),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_448.py b/configs/_base_/datasets/imagenet_bs16_eva_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..b90bba14eefb3c7e0bac8234dd84461a7b420462
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_448.py
@@ -0,0 +1,62 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=448,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=448,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_560.py b/configs/_base_/datasets/imagenet_bs16_eva_560.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e548cc2a8de33fcd8ec80a2652dabcb931519aa
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_560.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=560,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=560,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=560),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_pil_bicubic_384.py b/configs/_base_/datasets/imagenet_bs16_pil_bicubic_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..8507af4dd0219d8aa6449b6b3d9a1f8d39f1bfce
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_pil_bicubic_384.py
@@ -0,0 +1,53 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=384, backend='pillow', interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_beitv2.py b/configs/_base_/datasets/imagenet_bs256_beitv2.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d420326f2cf3e26f1478d684a03e39c51799534
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_beitv2.py
@@ -0,0 +1,47 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='TwoNormDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    second_mean=[127.5, 127.5, 127.5],
+    second_std=[127.5, 127.5, 127.5],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ColorJitter',
+        brightness=0.4,
+        contrast=0.4,
+        saturation=0.4,
+        hue=0.),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        second_size=224,
+        interpolation='bicubic',
+        second_interpolation='bicubic',
+        scale=(0.2, 1.0)),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=75,
+        min_num_patches=16),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs256_davit_224.py b/configs/_base_/datasets/imagenet_bs256_davit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ea0a8382d8feaae6f39808b6b1193684294f918
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_davit_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=236,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_itpn.py b/configs/_base_/datasets/imagenet_bs256_itpn.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b51c47272a99c4257a8c98dfe0b2bb8652e54a4
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_itpn.py
@@ -0,0 +1,49 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='TwoNormDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # clip mean & std
+    second_mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    second_std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ColorJitter',
+        brightness=0.4,
+        contrast=0.4,
+        saturation=0.4,
+        hue=0.),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        second_size=224,
+        interpolation='bicubic',
+        second_interpolation='bicubic',
+        scale=(0.2, 1.0)),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=75,
+        min_num_patches=16),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs256_levit_224.py b/configs/_base_/datasets/imagenet_bs256_levit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..612db7d7f0777ba50c78c084be8db7ba57266942
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_levit_224.py
@@ -0,0 +1,80 @@
+dataset_type = 'ImageNet'
+
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=256,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_rsb_a12.py b/configs/_base_/datasets/imagenet_bs256_rsb_a12.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab59d9e42fea20b316f306023c86c7b75acdb80f
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_rsb_a12.py
@@ -0,0 +1,72 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=236,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_rsb_a3.py b/configs/_base_/datasets/imagenet_bs256_rsb_a3.py
new file mode 100644
index 0000000000000000000000000000000000000000..02e34497d8ba68416cab4b08b8347a9781899a4f
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_rsb_a3.py
@@ -0,0 +1,72 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=6,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=236,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_simmim_192.py b/configs/_base_/datasets/imagenet_bs256_simmim_192.py
new file mode 100644
index 0000000000000000000000000000000000000000..45062e9c28bac95737e4783c80f353870343b6f2
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_simmim_192.py
@@ -0,0 +1,33 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=192, crop_ratio_range=(0.67, 1.0)),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='SimMIMMaskGenerator',
+        input_size=192,
+        mask_patch_size=32,
+        model_patch_size=4,
+        mask_ratio=0.6),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs256_swin_192.py b/configs/_base_/datasets/imagenet_bs256_swin_192.py
new file mode 100644
index 0000000000000000000000000000000000000000..11c2cb2a82ec320f18b21c89e2bd455a51912c24
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_swin_192.py
@@ -0,0 +1,81 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=192,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=219,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=192),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    collate_fn=dict(type='default_collate'),
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    collate_fn=dict(type='default_collate'),
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='val',
+        pipeline=test_pipeline),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32.py b/configs/_base_/datasets/imagenet_bs32.py
new file mode 100644
index 0000000000000000000000000000000000000000..a069bb9c3317079e2d7cdec8c8573ad0c7d42470
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32_byol.py b/configs/_base_/datasets/imagenet_bs32_byol.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7235b3be6fbfb79bcdc7179aef0bcd906475a68
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_byol.py
@@ -0,0 +1,89 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2)
+]
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs32_mocov2.py b/configs/_base_/datasets/imagenet_bs32_mocov2.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc60050dc748f3f28e0b68c83a1fd0910503039b
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_mocov2.py
@@ -0,0 +1,58 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+# The difference between mocov2 and mocov1 is the transforms in the pipeline
+view_pipeline = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.4,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.5),
+    dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='MultiView', num_views=2, transforms=[view_pipeline]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    drop_last=True,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs32_pil_bicubic.py b/configs/_base_/datasets/imagenet_bs32_pil_bicubic.py
new file mode 100644
index 0000000000000000000000000000000000000000..36880ff76abd2329199801f807ec3bb0469ec140
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_pil_bicubic.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32_pil_resize.py b/configs/_base_/datasets/imagenet_bs32_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9afc5cb0ed9fa7941b17fdfdae792b54adc9608
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_pil_resize.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32_simclr.py b/configs/_base_/datasets/imagenet_bs32_simclr.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e487b00b164eb964cfb4159a6918eb55d2b404e
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_simclr.py
@@ -0,0 +1,52 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+view_pipeline = [
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.8,
+                contrast=0.8,
+                saturation=0.8,
+                hue=0.2)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='MultiView', num_views=2, transforms=[view_pipeline]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs512_mae.py b/configs/_base_/datasets/imagenet_bs512_mae.py
new file mode 100644
index 0000000000000000000000000000000000000000..03d350eb0024a872e53f7d95ab7f3f12c4e70a25
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs512_mae.py
@@ -0,0 +1,32 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.0),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=512,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs512_mocov3.py b/configs/_base_/datasets/imagenet_bs512_mocov3.py
new file mode 100644
index 0000000000000000000000000000000000000000..1679f636e316a229744d8d79b8cda5c92e2b1450
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs512_mocov3.py
@@ -0,0 +1,90 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2),
+    dict(type='RandomFlip', prob=0.5),
+]
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=512,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs64.py b/configs/_base_/datasets/imagenet_bs64.py
new file mode 100644
index 0000000000000000000000000000000000000000..73e6d54bdde5523604dca93a8731765b4def92db
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_autoaug.py b/configs/_base_/datasets/imagenet_bs64_autoaug.py
new file mode 100644
index 0000000000000000000000000000000000000000..3160b8cf2afaa05cd49e09cabade7f4716bbd23d
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_autoaug.py
@@ -0,0 +1,59 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_clip_224.py b/configs/_base_/datasets/imagenet_bs64_clip_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..c200601ba45e7a1f317803e7c6f8c0ba34355623
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_clip_224.py
@@ -0,0 +1,73 @@
+# dataset settings
+dataset_type = 'ImageNet'
+img_norm_cfg = dict(
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True)
+image_size = 224
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        size=image_size,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+    # dict(
+    #     type='RandAugment',
+    #     policies={{_base_.rand_increasing_policies}},
+    #     num_policies=2,
+    #     total_level=10,
+    #     magnitude_level=9,
+    #     magnitude_std=0.5,
+    #     hparams=dict(
+    #         pad_val=[round(x) for x in img_norm_cfg['mean'][::-1]],
+    #         interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=img_norm_cfg['mean'][::-1],
+        fill_std=img_norm_cfg['std'][::-1]),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        size=(image_size, -1),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=image_size),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+    samples_per_gpu=64,
+    workers_per_gpu=8,
+    train=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    test=dict(
+        # replace `data/val` with `data/test` for standard test
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline))
+
+evaluation = dict(interval=10, metric='accuracy')
diff --git a/configs/_base_/datasets/imagenet_bs64_clip_384.py b/configs/_base_/datasets/imagenet_bs64_clip_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7caee678774a3baa1481163fe89fe35ee5e9b96
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_clip_384.py
@@ -0,0 +1,73 @@
+# dataset settings
+dataset_type = 'ImageNet'
+img_norm_cfg = dict(
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True)
+image_size = 384
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        size=image_size,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+    # dict(
+    #     type='RandAugment',
+    #     policies={{_base_.rand_increasing_policies}},
+    #     num_policies=2,
+    #     total_level=10,
+    #     magnitude_level=9,
+    #     magnitude_std=0.5,
+    #     hparams=dict(
+    #         pad_val=[round(x) for x in img_norm_cfg['mean'][::-1]],
+    #         interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=img_norm_cfg['mean'][::-1],
+        fill_std=img_norm_cfg['std'][::-1]),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        size=(image_size, -1),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=image_size),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+    samples_per_gpu=64,
+    workers_per_gpu=8,
+    train=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    test=dict(
+        # replace `data/val` with `data/test` for standard test
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline))
+
+evaluation = dict(interval=10, metric='accuracy')
diff --git a/configs/_base_/datasets/imagenet_bs64_clip_448.py b/configs/_base_/datasets/imagenet_bs64_clip_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..32a92ef66a30d6caff7d399fb321ec9283965920
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_clip_448.py
@@ -0,0 +1,74 @@
+# dataset settings
+dataset_type = 'ImageNet'
+img_norm_cfg = dict(
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True)
+image_size = 448
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        size=image_size,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+    # dict(
+    #     type='RandAugment',
+    #     policies={{_base_.rand_increasing_policies}},
+    #     num_policies=2,
+    #     total_level=10,
+    #     magnitude_level=9,
+    #     magnitude_std=0.5,
+    #     hparams=dict(
+    #         pad_val=[round(x) for x in img_norm_cfg['mean'][::-1]],
+    #         interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=img_norm_cfg['mean'][::-1],
+        fill_std=img_norm_cfg['std'][::-1]),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        size=(image_size, -1),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=image_size),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+    samples_per_gpu=64,
+    workers_per_gpu=8,
+    train=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    test=dict(
+        # replace `data/val` with `data/test` for standard test
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline))
+
+evaluation = dict(interval=10, metric='accuracy')
diff --git a/configs/_base_/datasets/imagenet_bs64_convmixer_224.py b/configs/_base_/datasets/imagenet_bs64_convmixer_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e9c0aa0f9bfc8883f3ee5d58464c8ea97f5e3bc
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_convmixer_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs')
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=233,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_deit3_224.py b/configs/_base_/datasets/imagenet_bs64_deit3_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e460a4d95a21d2ca3c3d6bb0d65e5c5409c14ff
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_deit3_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_deit3_384.py b/configs/_base_/datasets/imagenet_bs64_deit3_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc554ddba1d6a32a83638e7c2d58d27c345a4909
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_deit3_384.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=384,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_edgenext_256.py b/configs/_base_/datasets/imagenet_bs64_edgenext_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..7db9e4ef5f26691e364d244df0729827bf356293
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_edgenext_256.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=256,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=292,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_hivit_224.py b/configs/_base_/datasets/imagenet_bs64_hivit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c258d7ab50ac74c3b2bb30a852f8f38a0f10b83
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_hivit_224.py
@@ -0,0 +1,83 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_mixer_224.py b/configs/_base_/datasets/imagenet_bs64_mixer_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b92a5141b5d3c0784216c83effb7b171c631fccc
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_mixer_224.py
@@ -0,0 +1,52 @@
+# dataset settings
+dataset_type = 'ImageNet'
+
+# Google research usually use the below normalization setting.
+data_preprocessor = dict(
+    num_classes=1000,
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_pil_resize.py b/configs/_base_/datasets/imagenet_bs64_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..79f9325b022ac8b9219134a3b1ef47b584fcf3b2
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_pil_resize.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py b/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py
new file mode 100644
index 0000000000000000000000000000000000000000..c25906716c651d63440e1adeed66303ad7dae233
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py
@@ -0,0 +1,68 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_swin_224.py b/configs/_base_/datasets/imagenet_bs64_swin_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e8786eb0feb5cade66d01b6ce99b4240e11918b
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_swin_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_swin_256.py b/configs/_base_/datasets/imagenet_bs64_swin_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ecb41ba4d69c25ddc70469de440a0fde681fbc7
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_swin_256.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=256,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=292,  # ( 256 / 224 * 256 )
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_swin_384.py b/configs/_base_/datasets/imagenet_bs64_swin_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..11264f808c1d154c80f5609fbe25e1e7e69a5c88
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_swin_384.py
@@ -0,0 +1,54 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=384, backend='pillow', interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_t2t_224.py b/configs/_base_/datasets/imagenet_bs64_t2t_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a2dc10f85647fd20afd26d07a2c87a3e3a36962
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_t2t_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs8_pil_bicubic_320.py b/configs/_base_/datasets/imagenet_bs8_pil_bicubic_320.py
new file mode 100644
index 0000000000000000000000000000000000000000..7160084e56b44205d92a8266fc78ff51bf2a7b4c
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs8_pil_bicubic_320.py
@@ -0,0 +1,59 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[122.5, 122.5, 122.5],
+    std=[122.5, 122.5, 122.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=320,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=int(320 / 224 * 256),
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=320),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/inshop_bs32_448.py b/configs/_base_/datasets/inshop_bs32_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9772fa665d4a5a3abae575a8fc61fb9f360cd0e
--- /dev/null
+++ b/configs/_base_/datasets/inshop_bs32_448.py
@@ -0,0 +1,64 @@
+# dataset settings
+dataset_type = 'InShop'
+data_preprocessor = dict(
+    num_classes=3997,
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=512),
+    dict(type='RandomCrop', crop_size=448),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=512),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/inshop',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+query_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/inshop',
+        split='query',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+
+gallery_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/inshop',
+        split='gallery',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_dataloader = query_dataloader
+val_evaluator = [
+    dict(type='RetrievalRecall', topk=1),
+    dict(type='RetrievalAveragePrecision', topk=10),
+]
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/nlvr2.py b/configs/_base_/datasets/nlvr2.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f5314bcd14d9e4f79898411e9c687470e31ac02
--- /dev/null
+++ b/configs/_base_/datasets/nlvr2.py
@@ -0,0 +1,86 @@
+# dataset settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(
+        type='ApplyToList',
+        # NLVR requires to load two images in task.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='RandomResizedCrop',
+                scale=384,
+                interpolation='bicubic',
+                backend='pillow'),
+            dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(
+        type='ApplyToList',
+        # NLVR requires to load two images in task.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='Resize',
+                scale=(384, 384),
+                interpolation='bicubic',
+                backend='pillow'),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='NLVR2',
+        data_root='data/nlvr2',
+        ann_file='dev.json',
+        data_prefix='dev',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=8,
+    dataset=dict(
+        type='NLVR2',
+        data_root='data/nlvr2',
+        ann_file='dev.json',
+        data_prefix='dev',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='Accuracy')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/nocaps.py b/configs/_base_/datasets/nocaps.py
new file mode 100644
index 0000000000000000000000000000000000000000..5176671f2b9335b12127c7b58b2626eec12476ea
--- /dev/null
+++ b/configs/_base_/datasets/nocaps.py
@@ -0,0 +1,41 @@
+# data settings
+
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type='NoCaps',
+        data_root='data/nocaps/',
+        data_prefix=dict(img_path='images/'),
+        ann_file='annotations/nocaps_val_4500_captions.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='NocapsSave',
+    save_dir='./',
+)
+
+# # If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/ocrvqa.py b/configs/_base_/datasets/ocrvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..09e6e3536141f8ea901d2e5bb3070c23d816e8bc
--- /dev/null
+++ b/configs/_base_/datasets/ocrvqa.py
@@ -0,0 +1,81 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys=['question', 'gt_answer']),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=[],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys=['question', 'gt_answer']),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=[],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='OCRVQA',
+        data_root='data/ocrvqa',
+        data_prefix='images',
+        ann_file='annotations/dataset.json',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=8,
+    dataset=dict(
+        type='OCRVQA',
+        data_root='data/ocrvqa',
+        data_prefix='images',
+        ann_file='annotations/dataset.json',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=64,
+    num_workers=8,
+    dataset=dict(
+        type='OCRVQA',
+        data_root='data/ocrvqa',
+        data_prefix='images',
+        ann_file='annotations/dataset.json',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='VQAAcc')
diff --git a/configs/_base_/datasets/pipelines/auto_aug.py b/configs/_base_/datasets/pipelines/auto_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a10f7eec61ea40336698118342939470f73d052
--- /dev/null
+++ b/configs/_base_/datasets/pipelines/auto_aug.py
@@ -0,0 +1,96 @@
+# Policy for ImageNet, refers to
+# https://github.com/DeepVoltaire/AutoAugment/blame/master/autoaugment.py
+policy_imagenet = [
+    [
+        dict(type='Posterize', bits=4, prob=0.4),
+        dict(type='Rotate', angle=30., prob=0.6)
+    ],
+    [
+        dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
+        dict(type='AutoContrast', prob=0.6)
+    ],
+    [dict(type='Equalize', prob=0.8),
+     dict(type='Equalize', prob=0.6)],
+    [
+        dict(type='Posterize', bits=5, prob=0.6),
+        dict(type='Posterize', bits=5, prob=0.6)
+    ],
+    [
+        dict(type='Equalize', prob=0.4),
+        dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
+    ],
+    [
+        dict(type='Equalize', prob=0.4),
+        dict(type='Rotate', angle=30 / 9 * 8, prob=0.8)
+    ],
+    [
+        dict(type='Solarize', thr=256 / 9 * 6, prob=0.6),
+        dict(type='Equalize', prob=0.6)
+    ],
+    [dict(type='Posterize', bits=6, prob=0.8),
+     dict(type='Equalize', prob=1.)],
+    [
+        dict(type='Rotate', angle=10., prob=0.2),
+        dict(type='Solarize', thr=256 / 9, prob=0.6)
+    ],
+    [
+        dict(type='Equalize', prob=0.6),
+        dict(type='Posterize', bits=5, prob=0.4)
+    ],
+    [
+        dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
+        dict(type='ColorTransform', magnitude=0., prob=0.4)
+    ],
+    [
+        dict(type='Rotate', angle=30., prob=0.4),
+        dict(type='Equalize', prob=0.6)
+    ],
+    [dict(type='Equalize', prob=0.0),
+     dict(type='Equalize', prob=0.8)],
+    [dict(type='Invert', prob=0.6),
+     dict(type='Equalize', prob=1.)],
+    [
+        dict(type='ColorTransform', magnitude=0.4, prob=0.6),
+        dict(type='Contrast', magnitude=0.8, prob=1.)
+    ],
+    [
+        dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
+        dict(type='ColorTransform', magnitude=0.2, prob=1.)
+    ],
+    [
+        dict(type='ColorTransform', magnitude=0.8, prob=0.8),
+        dict(type='Solarize', thr=256 / 9 * 2, prob=0.8)
+    ],
+    [
+        dict(type='Sharpness', magnitude=0.7, prob=0.4),
+        dict(type='Invert', prob=0.6)
+    ],
+    [
+        dict(
+            type='Shear',
+            magnitude=0.3 / 9 * 5,
+            prob=0.6,
+            direction='horizontal'),
+        dict(type='Equalize', prob=1.)
+    ],
+    [
+        dict(type='ColorTransform', magnitude=0., prob=0.4),
+        dict(type='Equalize', prob=0.6)
+    ],
+    [
+        dict(type='Equalize', prob=0.4),
+        dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
+    ],
+    [
+        dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
+        dict(type='AutoContrast', prob=0.6)
+    ],
+    [dict(type='Invert', prob=0.6),
+     dict(type='Equalize', prob=1.)],
+    [
+        dict(type='ColorTransform', magnitude=0.4, prob=0.6),
+        dict(type='Contrast', magnitude=0.8, prob=1.)
+    ],
+    [dict(type='Equalize', prob=0.8),
+     dict(type='Equalize', prob=0.6)],
+]
diff --git a/configs/_base_/datasets/pipelines/rand_aug.py b/configs/_base_/datasets/pipelines/rand_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2bab3c364f0d0223f2c972673da3abb6ac21bc6
--- /dev/null
+++ b/configs/_base_/datasets/pipelines/rand_aug.py
@@ -0,0 +1,43 @@
+# Refers to `_RAND_INCREASING_TRANSFORMS` in pytorch-image-models
+rand_increasing_policies = [
+    dict(type='AutoContrast'),
+    dict(type='Equalize'),
+    dict(type='Invert'),
+    dict(type='Rotate', magnitude_key='angle', magnitude_range=(0, 30)),
+    dict(type='Posterize', magnitude_key='bits', magnitude_range=(4, 0)),
+    dict(type='Solarize', magnitude_key='thr', magnitude_range=(256, 0)),
+    dict(
+        type='SolarizeAdd',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 110)),
+    dict(
+        type='ColorTransform',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.9)),
+    dict(type='Contrast', magnitude_key='magnitude', magnitude_range=(0, 0.9)),
+    dict(
+        type='Brightness', magnitude_key='magnitude',
+        magnitude_range=(0, 0.9)),
+    dict(
+        type='Sharpness', magnitude_key='magnitude', magnitude_range=(0, 0.9)),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='horizontal'),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='vertical'),
+    dict(
+        type='Translate',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.45),
+        direction='horizontal'),
+    dict(
+        type='Translate',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.45),
+        direction='vertical')
+]
diff --git a/configs/_base_/datasets/refcoco.py b/configs/_base_/datasets/refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..f698e76c032fb22cc739450cc1e81e3174fd2b2f
--- /dev/null
+++ b/configs/_base_/datasets/refcoco.py
@@ -0,0 +1,105 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.4,
+                hue=0.1,
+                backend='cv2')
+        ],
+        prob=0.5),
+    dict(
+        type='mmdet.RandomCrop',
+        crop_type='relative_range',
+        crop_size=(0.8, 0.8),
+        allow_negative_crop=False),
+    dict(
+        type='RandomChoiceResize',
+        scales=[(384, 384), (360, 360), (344, 344), (312, 312), (300, 300),
+                (286, 286), (270, 270)],
+        keep_ratio=False),
+    dict(
+        type='RandomTranslatePad',
+        size=384,
+        aug_translate=True,
+    ),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_bboxes', 'scale_factor'],
+        meta_keys=['image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_bboxes', 'scale_factor'],
+        meta_keys=['image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='RefCOCO',
+        data_root='data/coco',
+        data_prefix='train2014',
+        ann_file='refcoco/instances.json',
+        split_file='refcoco/refs(unc).p',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='RefCOCO',
+        data_root='data/coco',
+        data_prefix='train2014',
+        ann_file='refcoco/instances.json',
+        split_file='refcoco/refs(unc).p',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+
+val_evaluator = dict(type='VisualGroundingMetric')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='RefCOCO',
+        data_root='data/coco',
+        data_prefix='train2014',
+        ann_file='refcoco/instances.json',
+        split_file='refcoco/refs(unc).p',
+        split='testA',  # or 'testB'
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/vizwiz.py b/configs/_base_/datasets/vizwiz.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb7156c07030e9c031c8796c62267b7c4a8b2d7a
--- /dev/null
+++ b/configs/_base_/datasets/vizwiz.py
@@ -0,0 +1,80 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VizWiz',
+        data_root='data/vizwiz/Images',
+        data_prefix='',
+        ann_file='Annotations/train.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VizWiz',
+        data_root='data/vizwiz/Images',
+        data_prefix='',
+        ann_file='Annotations/val.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VizWizAcc')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VizWiz',
+        data_root='data/vizwiz/Images',
+        data_prefix='',
+        ann_file='Annotations/test.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test.json')
diff --git a/configs/_base_/datasets/voc_bs16.py b/configs/_base_/datasets/voc_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..cac2248cb6f0fc96a1e1407e06bba5fbc9e70a4b
--- /dev/null
+++ b/configs/_base_/datasets/voc_bs16.py
@@ -0,0 +1,65 @@
+# dataset settings
+dataset_type = 'VOC'
+data_preprocessor = dict(
+    num_classes=20,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+    # generate onehot-format labels for multi-label classification.
+    to_onehot=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(
+        type='PackInputs',
+        # `gt_label_difficult` is needed for VOC evaluation
+        meta_keys=('sample_idx', 'img_path', 'ori_shape', 'img_shape',
+                   'scale_factor', 'flip', 'flip_direction',
+                   'gt_label_difficult')),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/VOC2007',
+        split='trainval',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/VOC2007',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+
+test_dataloader = val_dataloader
+
+# calculate precision_recall_f1 and mAP
+val_evaluator = [
+    dict(type='VOCMultiLabelMetric'),
+    dict(type='VOCMultiLabelMetric', average='micro'),
+    dict(type='VOCAveragePrecision')
+]
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/vsr.py b/configs/_base_/datasets/vsr.py
new file mode 100644
index 0000000000000000000000000000000000000000..0fa9b8992d0c453797b38add80dd6c92fbfa9227
--- /dev/null
+++ b/configs/_base_/datasets/vsr.py
@@ -0,0 +1,81 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VSR',
+        data_root='data/coco',
+        data_prefix='',
+        ann_file='annotations/train.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VSR',
+        data_root='data/coco',
+        data_prefix='',
+        ann_file='annotations/val.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VSRAcc')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VSR',
+        data_root='data/coco',
+        data_prefix='',
+        ann_file='annotations/test.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/default_runtime.py b/configs/_base_/default_runtime.py
new file mode 100644
index 0000000000000000000000000000000000000000..3816d423fabab10d26b0abfea1f60eb270c1dc83
--- /dev/null
+++ b/configs/_base_/default_runtime.py
@@ -0,0 +1,51 @@
+# defaults to use registries in mmpretrain
+default_scope = 'mmpretrain'
+
+# configure default hooks
+default_hooks = dict(
+    # record the time of every iteration.
+    timer=dict(type='IterTimerHook'),
+
+    # print log every 100 iterations.
+    logger=dict(type='LoggerHook', interval=100),
+
+    # enable the parameter scheduler.
+    param_scheduler=dict(type='ParamSchedulerHook'),
+
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1),
+
+    # set sampler seed in distributed evrionment.
+    sampler_seed=dict(type='DistSamplerSeedHook'),
+
+    # validation results visualization, set True to enable it.
+    visualization=dict(type='VisualizationHook', enable=False),
+)
+
+# configure environment
+env_cfg = dict(
+    # whether to enable cudnn benchmark
+    cudnn_benchmark=False,
+
+    # set multi process parameters
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+    # set distributed parameters
+    dist_cfg=dict(backend='nccl'),
+)
+
+# set visualizer
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(type='UniversalVisualizer', vis_backends=vis_backends)
+
+# set log level
+log_level = 'INFO'
+
+# load from which checkpoint
+load_from = None
+
+# whether to resume training from the loaded checkpoint
+resume = False
+
+# Defaults to use random seed and disable `deterministic`
+randomness = dict(seed=None, deterministic=False)
diff --git a/configs/_base_/models/conformer/base-p16.py b/configs/_base_/models/conformer/base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..959da5059a8f36c1076bf9875c51fd466fc96fa4
--- /dev/null
+++ b/configs/_base_/models/conformer/base-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Conformer', arch='base', drop_path_rate=0.1, init_cfg=None),
+    neck=None,
+    head=dict(
+        type='ConformerHead',
+        num_classes=1000,
+        in_channels=[1536, 576],
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/conformer/small-p16.py b/configs/_base_/models/conformer/small-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e4f9f80745af51538306bd8928082f3fd2e9997
--- /dev/null
+++ b/configs/_base_/models/conformer/small-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Conformer', arch='small', drop_path_rate=0.1, init_cfg=None),
+    neck=None,
+    head=dict(
+        type='ConformerHead',
+        num_classes=1000,
+        in_channels=[1024, 384],
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/conformer/small-p32.py b/configs/_base_/models/conformer/small-p32.py
new file mode 100644
index 0000000000000000000000000000000000000000..f73811fff492f3e1770e514335ccc71b2bd3caf6
--- /dev/null
+++ b/configs/_base_/models/conformer/small-p32.py
@@ -0,0 +1,27 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Conformer',
+        arch='small',
+        patch_size=32,
+        drop_path_rate=0.1,
+        init_cfg=None),
+    neck=None,
+    head=dict(
+        type='ConformerHead',
+        num_classes=1000,
+        in_channels=[1024, 384],
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/conformer/tiny-p16.py b/configs/_base_/models/conformer/tiny-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa9753b6fac957a0c8f9612bd0b9a693a3ecbf4e
--- /dev/null
+++ b/configs/_base_/models/conformer/tiny-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Conformer', arch='tiny', drop_path_rate=0.1, init_cfg=None),
+    neck=None,
+    head=dict(
+        type='ConformerHead',
+        num_classes=1000,
+        in_channels=[256, 384],
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/convmixer/convmixer-1024-20.py b/configs/_base_/models/convmixer/convmixer-1024-20.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8f4d517e0d5e74c0d0412bb6e4f43b244761c03
--- /dev/null
+++ b/configs/_base_/models/convmixer/convmixer-1024-20.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvMixer', arch='1024/20'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/convmixer/convmixer-1536-20.py b/configs/_base_/models/convmixer/convmixer-1536-20.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ad8209bb4fc55665be36cdcd8102d854c533951
--- /dev/null
+++ b/configs/_base_/models/convmixer/convmixer-1536-20.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvMixer', arch='1536/20'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/convmixer/convmixer-768-32.py b/configs/_base_/models/convmixer/convmixer-768-32.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cba528b0edf9d394ae9730ecd51d41bbd314b38
--- /dev/null
+++ b/configs/_base_/models/convmixer/convmixer-768-32.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvMixer', arch='768/32', act_cfg=dict(type='ReLU')),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/convnext/convnext-base.py b/configs/_base_/models/convnext/convnext-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..aba6c19d1ac5039bab2363f80d500c81d4bb809b
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-base.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='base', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-large.py b/configs/_base_/models/convnext/convnext-large.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bd4d9f68bd47b207de129ab169c2366156199b3
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-large.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='large', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-small.py b/configs/_base_/models/convnext/convnext-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..aeedb6d22fc8f80fe6c5fb246df44c8a28c41854
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-small.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='small', drop_path_rate=0.4),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-tiny.py b/configs/_base_/models/convnext/convnext-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..05baba09eefe44196a54c112c5c785ff79a1b52b
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-tiny.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='tiny', drop_path_rate=0.1),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-xlarge.py b/configs/_base_/models/convnext/convnext-xlarge.py
new file mode 100644
index 0000000000000000000000000000000000000000..7211b94f6cebe4c93d150dec276291f725f9f513
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-xlarge.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='xlarge', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext_v2/atto.py b/configs/_base_/models/convnext_v2/atto.py
new file mode 100644
index 0000000000000000000000000000000000000000..557ce93fce2572fe2fd95db80da4556e0dd7810d
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/atto.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='atto',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=320,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.2),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/base.py b/configs/_base_/models/convnext_v2/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..1401ef75f96814d5db1f6a37aa8d8761ccfe1e39
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/base.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='base',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext_v2/femto.py b/configs/_base_/models/convnext_v2/femto.py
new file mode 100644
index 0000000000000000000000000000000000000000..d56a241a97820713618480bec0fe09f94ecb1cea
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/femto.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='femto',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/huge.py b/configs/_base_/models/convnext_v2/huge.py
new file mode 100644
index 0000000000000000000000000000000000000000..54141dd5220fdd0f40ce21054890e86b19597aff
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/huge.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='huge',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2816,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext_v2/large.py b/configs/_base_/models/convnext_v2/large.py
new file mode 100644
index 0000000000000000000000000000000000000000..20237de2baaccd2779bcec45549ec5a294d8ba6b
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/large.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='large',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext_v2/nano.py b/configs/_base_/models/convnext_v2/nano.py
new file mode 100644
index 0000000000000000000000000000000000000000..05575d0e105da6880beafa08d1bdb0c608261a51
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/nano.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='nano',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=640,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.2),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/pico.py b/configs/_base_/models/convnext_v2/pico.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d50ba890069457bc512ac2d2da1038ee73cd065
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/pico.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='pico',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/tiny.py b/configs/_base_/models/convnext_v2/tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9835ccdb47f8c976be9519160ba13f6f4a168f9
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/tiny.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='tiny',
+        drop_path_rate=0.2,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.2),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/davit/davit-base.py b/configs/_base_/models/davit/davit-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..0dbf07739ecc907e4a77d0cdbd9c21f4c8fbecf1
--- /dev/null
+++ b/configs/_base_/models/davit/davit-base.py
@@ -0,0 +1,16 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DaViT', arch='base', out_indices=(3, ), drop_path_rate=0.4),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/davit/davit-small.py b/configs/_base_/models/davit/davit-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fa0325552c2bc28f69263ba42547090b7a521fb
--- /dev/null
+++ b/configs/_base_/models/davit/davit-small.py
@@ -0,0 +1,16 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DaViT', arch='small', out_indices=(3, ), drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/davit/davit-tiny.py b/configs/_base_/models/davit/davit-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..29432d28bd09a613bf4eaabe4f8ef4d0d763a49d
--- /dev/null
+++ b/configs/_base_/models/davit/davit-tiny.py
@@ -0,0 +1,16 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DaViT', arch='t', out_indices=(3, ), drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-base-p16-224.py b/configs/_base_/models/deit3/deit3-base-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..84cba1afadbf13ed78e5f3c2be112a70b5ba8be1
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-base-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='b',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.2),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-base-p16-384.py b/configs/_base_/models/deit3/deit3-base-p16-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c9f42bc3a3b69c5091c5a31c0d7a137fb944cf5
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-base-p16-384.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        drop_path_rate=0.15),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-huge-p14-224.py b/configs/_base_/models/deit3/deit3-huge-p14-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7a69ce914fbc32b029cb1a891fb1cf49d4bfce0
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-huge-p14-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='h',
+        img_size=224,
+        patch_size=14,
+        drop_path_rate=0.55),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-large-p16-224.py b/configs/_base_/models/deit3/deit3-large-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..96135c57879715a1de50efd8e6c28fc635eae1ff
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-large-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='l',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.45),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-large-p16-384.py b/configs/_base_/models/deit3/deit3-large-p16-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa9326c17cd0b0e1d625270140a80f1bb92fc0bf
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-large-p16-384.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='l',
+        img_size=384,
+        patch_size=16,
+        drop_path_rate=0.4),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-medium-p16-224.py b/configs/_base_/models/deit3/deit3-medium-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..84233e5cfde13cd0f142b49f64c3b3ec65ff4f68
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-medium-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='m',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.2),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-small-p16-224.py b/configs/_base_/models/deit3/deit3-small-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..af29d32bc799ebdff5a9724fe5555261ba0b584c
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-small-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='s',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.05),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-small-p16-384.py b/configs/_base_/models/deit3/deit3-small-p16-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..bebb4845e8c3a47e1d944702c49357d6d8aa4cd6
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-small-p16-384.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='s',
+        img_size=384,
+        patch_size=16,
+        drop_path_rate=0.0),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/densenet/densenet121.py b/configs/_base_/models/densenet/densenet121.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a14d302584a910e87ccf598e9434bd0685207aa
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet121.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='DenseNet', arch='121'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/densenet/densenet161.py b/configs/_base_/models/densenet/densenet161.py
new file mode 100644
index 0000000000000000000000000000000000000000..61a0d838806267a5c987fa30eeb6363f23387ef3
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet161.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='DenseNet', arch='161'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2208,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/densenet/densenet169.py b/configs/_base_/models/densenet/densenet169.py
new file mode 100644
index 0000000000000000000000000000000000000000..779ea1709256f8c001adaa3c73155c36d3363d71
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet169.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='DenseNet', arch='169'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1664,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/densenet/densenet201.py b/configs/_base_/models/densenet/densenet201.py
new file mode 100644
index 0000000000000000000000000000000000000000..2909af0d36c656c1868ff38e72981dc9dafeaa2f
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet201.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='DenseNet', arch='201'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1920,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/edgenext/edgenext-base.py b/configs/_base_/models/edgenext/edgenext-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..378397298ed9d51241ad737d65b05f151ac69393
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-base.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EdgeNeXt',
+        arch='base',
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=584,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/edgenext/edgenext-small.py b/configs/_base_/models/edgenext/edgenext-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..e1f7e1728a2f5cb895600aa0d81eeb5734dffec0
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-small.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EdgeNeXt',
+        arch='small',
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=304,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/edgenext/edgenext-xsmall.py b/configs/_base_/models/edgenext/edgenext-xsmall.py
new file mode 100644
index 0000000000000000000000000000000000000000..69c7d0d6a6ec9d09df03c007cd3fffa93165f5cb
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-xsmall.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EdgeNeXt',
+        arch='xsmall',
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/edgenext/edgenext-xxsmall.py b/configs/_base_/models/edgenext/edgenext-xxsmall.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb6881951fae8c01c2a4ea78c3d61e7c6a900f24
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-xxsmall.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EdgeNeXt',
+        arch='xxsmall',
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=168,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/efficientformer-l1.py b/configs/_base_/models/efficientformer-l1.py
new file mode 100644
index 0000000000000000000000000000000000000000..37dc62cd235ee5a3f0257a24c54c8eb4fc797159
--- /dev/null
+++ b/configs/_base_/models/efficientformer-l1.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EfficientFormer',
+        arch='l1',
+        drop_path_rate=0,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+            dict(type='Constant', layer=['LayerScale'], val=1e-5)
+        ]),
+    neck=dict(type='GlobalAveragePooling', dim=1),
+    head=dict(
+        type='EfficientFormerClsHead', in_channels=448, num_classes=1000))
diff --git a/configs/_base_/models/efficientnet_b0.py b/configs/_base_/models/efficientnet_b0.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9ba685306c9e411a69887a2a301808cbaa104cb
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b0.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b0'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b1.py b/configs/_base_/models/efficientnet_b1.py
new file mode 100644
index 0000000000000000000000000000000000000000..63e15c88b2f7e1d1c788811741ff26bf5f35601f
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b1.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b1'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b2.py b/configs/_base_/models/efficientnet_b2.py
new file mode 100644
index 0000000000000000000000000000000000000000..5edcfa5d5b680ec41567e531e0b7a587e160c8af
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b2.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b2'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1408,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b3.py b/configs/_base_/models/efficientnet_b3.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7c6d6d899ecb910a37cbd3818f8c79c27db87e9
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b3.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b3'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b4.py b/configs/_base_/models/efficientnet_b4.py
new file mode 100644
index 0000000000000000000000000000000000000000..06840ed559cc14ae47919f7cce67d635173e841d
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b4.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b4'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1792,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b5.py b/configs/_base_/models/efficientnet_b5.py
new file mode 100644
index 0000000000000000000000000000000000000000..a86eebd19042eb36534ef3f42cc16bb32e88fb66
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b5.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b5'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b6.py b/configs/_base_/models/efficientnet_b6.py
new file mode 100644
index 0000000000000000000000000000000000000000..4eada1d32511371bcb11c636b3aae9dc4733d379
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b6.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b6'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2304,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b7.py b/configs/_base_/models/efficientnet_b7.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d84ba427f42a186f376d829189461536e7ee383
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b7.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b7'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2560,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b8.py b/configs/_base_/models/efficientnet_b8.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9500644dae4a3240c5ecfa02f90deb8fde4e3de
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b8.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b8'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2816,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_em.py b/configs/_base_/models/efficientnet_em.py
new file mode 100644
index 0000000000000000000000000000000000000000..abecdbeef6c3791f902b6bd13fbceb28c3ac8942
--- /dev/null
+++ b/configs/_base_/models/efficientnet_em.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    # `em` means EfficientNet-EdgeTPU-M arch
+    backbone=dict(type='EfficientNet', arch='em', act_cfg=dict(type='ReLU')),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_es.py b/configs/_base_/models/efficientnet_es.py
new file mode 100644
index 0000000000000000000000000000000000000000..911ba4a18261decd3d17e8962501083e1f1ea550
--- /dev/null
+++ b/configs/_base_/models/efficientnet_es.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    # `es` means EfficientNet-EdgeTPU-S arch
+    backbone=dict(type='EfficientNet', arch='es', act_cfg=dict(type='ReLU')),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_l2.py b/configs/_base_/models/efficientnet_l2.py
new file mode 100644
index 0000000000000000000000000000000000000000..4219c87a81a93c50296cfebed8f20b9bbd2a4c13
--- /dev/null
+++ b/configs/_base_/models/efficientnet_l2.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='l2'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=5504,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b0.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b0.py
new file mode 100644
index 0000000000000000000000000000000000000000..d42e32905ed9d18ab572bfe1e90c7161f941a34f
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b0.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='b0'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b1.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b1.py
new file mode 100644
index 0000000000000000000000000000000000000000..10736fc504637b07fe362e27c5e86ea73990217a
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b1.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='b1'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b2.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b2.py
new file mode 100644
index 0000000000000000000000000000000000000000..61f477120e031cd8cf46340bdbd3c687ade2a035
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b2.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='b2'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1408,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b3.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b3.py
new file mode 100644
index 0000000000000000000000000000000000000000..14e523fd2e4180e960aa8a3282e56f6604c38a47
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b3.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='b3'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_l.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_l.py
new file mode 100644
index 0000000000000000000000000000000000000000..456467d6fa076db11b009fca875e231569e05288
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_l.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='l'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_m.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_m.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e4d303f624d3375416b7c41c59a68a1a64e4a19
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_m.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='m'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_s.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_s.py
new file mode 100644
index 0000000000000000000000000000000000000000..866648223c79aac1ca8519a1d18b167b7ac474ec
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_s.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='s'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_xl.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_xl.py
new file mode 100644
index 0000000000000000000000000000000000000000..2216c9daa7d5e5e11084320b3aeab6a388588f40
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_xl.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='xl'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/eva/eva-g.py b/configs/_base_/models/eva/eva-g.py
new file mode 100644
index 0000000000000000000000000000000000000000..17bc84ad8bd2ac5599f26351b5fb5ca3fb8ec8bc
--- /dev/null
+++ b/configs/_base_/models/eva/eva-g.py
@@ -0,0 +1,29 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='eva-g',
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1408,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/eva/eva-l.py b/configs/_base_/models/eva/eva-l.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b08e4b1e1881b706848c121ceb3b4d23cfae34a
--- /dev/null
+++ b/configs/_base_/models/eva/eva-l.py
@@ -0,0 +1,30 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='l',
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+        layer_cfgs=dict(bias=True),
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hivit/base_224.py b/configs/_base_/models/hivit/base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..a87a68cf6f03e3e794361324fe5158b6a7dc5faa
--- /dev/null
+++ b/configs/_base_/models/hivit/base_224.py
@@ -0,0 +1,28 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='HiViT',
+        arch='base',
+        img_size=224,
+        ape=True,
+        rpe=True,
+        drop_path_rate=0.5),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/hivit/small_224.py b/configs/_base_/models/hivit/small_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..333b2461d3ef681dd24f367f18e38f2cc87dd2de
--- /dev/null
+++ b/configs/_base_/models/hivit/small_224.py
@@ -0,0 +1,28 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='HiViT',
+        arch='small',
+        img_size=224,
+        ape=True,
+        rpe=True,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/hivit/tiny_224.py b/configs/_base_/models/hivit/tiny_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b3e2fdb3ce64aa8cfe42fb0b923d34fcdbb0524f
--- /dev/null
+++ b/configs/_base_/models/hivit/tiny_224.py
@@ -0,0 +1,28 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='HiViT',
+        arch='tiny',
+        img_size=224,
+        ape=True,
+        rpe=True,
+        drop_path_rate=0.05),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/hornet/hornet-base-gf.py b/configs/_base_/models/hornet/hornet-base-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6924f96265cda310a38765fa460ad685d9d01b7
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-base-gf.py
@@ -0,0 +1,20 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='base-gf', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-base.py b/configs/_base_/models/hornet/hornet-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..904379ab5f258fa366d75166e7446fccecf0bc2c
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-base.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='base', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-large-gf.py b/configs/_base_/models/hornet/hornet-large-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..1607ba2208415699697f8ada17941cc75a6270a9
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-large-gf.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='large-gf', drop_path_rate=0.2),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-large-gf384.py b/configs/_base_/models/hornet/hornet-large-gf384.py
new file mode 100644
index 0000000000000000000000000000000000000000..fbb547873ed047adaed448fb1d443b4de8750ea4
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-large-gf384.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='large-gf384', drop_path_rate=0.4),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ])
diff --git a/configs/_base_/models/hornet/hornet-large.py b/configs/_base_/models/hornet/hornet-large.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5494fd8985970c2a60424ab6b6e07cd8965a6ed
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-large.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='large', drop_path_rate=0.2),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-small-gf.py b/configs/_base_/models/hornet/hornet-small-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..42e26d3a4bf75aab77a3fbdda2135bed98223476
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-small-gf.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='small-gf', drop_path_rate=0.4),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-small.py b/configs/_base_/models/hornet/hornet-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..d59184d40ab2f8a5c03c82caeade85dcd32c9180
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-small.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='small', drop_path_rate=0.4),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-tiny-gf.py b/configs/_base_/models/hornet/hornet-tiny-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b06f5b121f18f26c5a3a3442f3bbf8842bdd206
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-tiny-gf.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='tiny-gf', drop_path_rate=0.2),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-tiny.py b/configs/_base_/models/hornet/hornet-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..aed710eb862467da4d39c13a4fad41e7e6b76f29
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-tiny.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='tiny', drop_path_rate=0.2),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hrnet/hrnet-w18.py b/configs/_base_/models/hrnet/hrnet-w18.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7fbf298d5b64ba1cefa46a4a5d2823c2fa8cf17
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w18.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w18'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(18, 36, 72, 144)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w30.py b/configs/_base_/models/hrnet/hrnet-w30.py
new file mode 100644
index 0000000000000000000000000000000000000000..babcacac59af0ff92802a71f48b249b29a760acb
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w30.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w30'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(30, 60, 120, 240)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w32.py b/configs/_base_/models/hrnet/hrnet-w32.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c1e980048d6bb855b94e0bb3027941d07513c05
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w32.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w32'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(32, 64, 128, 256)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w40.py b/configs/_base_/models/hrnet/hrnet-w40.py
new file mode 100644
index 0000000000000000000000000000000000000000..83f65d864679297b25b39438d49eb491c92c33a1
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w40.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w40'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(40, 80, 160, 320)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w44.py b/configs/_base_/models/hrnet/hrnet-w44.py
new file mode 100644
index 0000000000000000000000000000000000000000..e75dc0f891f6f9dd14ba31b865fd29afd622f4db
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w44.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w44'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(44, 88, 176, 352)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w48.py b/configs/_base_/models/hrnet/hrnet-w48.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0604958481ba2af277e3a0f9515dc1423def6c6
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w48.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w48'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(48, 96, 192, 384)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w64.py b/configs/_base_/models/hrnet/hrnet-w64.py
new file mode 100644
index 0000000000000000000000000000000000000000..844c3fe9413f624dd374ceb1a9c3bbc185a20a3e
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w64.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w64'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(64, 128, 256, 512)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/inception_v3.py b/configs/_base_/models/inception_v3.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f6a8305efe2ef87cfd0d2676056a07595831c6b
--- /dev/null
+++ b/configs/_base_/models/inception_v3.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='InceptionV3', num_classes=1000, aux_logits=False),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)),
+)
diff --git a/configs/_base_/models/itpn_hivit-base-p16.py b/configs/_base_/models/itpn_hivit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..834d6fe53b30b3370df0e5aaa08d6786472810a6
--- /dev/null
+++ b/configs/_base_/models/itpn_hivit-base-p16.py
@@ -0,0 +1,33 @@
+# model settings
+model = dict(
+    type='iTPN',
+    backbone=dict(
+        type='iTPNHiViT',
+        arch='base',
+        reconstruction_type='pixel',
+        mask_ratio=0.75),
+    neck=dict(
+        type='iTPNPretrainDecoder',
+        num_patches=196,
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        decoder_embed_dim=512,
+        decoder_depth=6,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+        reconstruction_type='pixel',
+        #  transformer pyramid
+        fpn_dim=256,
+        fpn_depth=2,
+        num_outs=3,
+    ),
+    head=dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    init_cfg=[
+        dict(type='Xavier', layer='Linear', distribution='uniform'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ])
diff --git a/configs/_base_/models/levit-256-p16.py b/configs/_base_/models/levit-256-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..936305bd254cb0c46f1bd0e8d0698f76b9a765c4
--- /dev/null
+++ b/configs/_base_/models/levit-256-p16.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='LeViT',
+        arch='256',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0,
+        attn_ratio=2,
+        mlp_ratio=2,
+        out_indices=(2, )),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LeViTClsHead',
+        num_classes=1000,
+        in_channels=512,
+        distillation=True,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]))
diff --git a/configs/_base_/models/mae_hivit-base-p16.py b/configs/_base_/models/mae_hivit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..bac073c840120c67e3c97b43bd5b308c62dbbbd9
--- /dev/null
+++ b/configs/_base_/models/mae_hivit-base-p16.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='MAE',
+    backbone=dict(
+        type='MAEHiViT', patch_size=16, arch='base', mask_ratio=0.75),
+    neck=dict(
+        type='MAEPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        decoder_embed_dim=512,
+        decoder_depth=6,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+    ),
+    head=dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    init_cfg=[
+        dict(type='Xavier', layer='Linear', distribution='uniform'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ])
diff --git a/configs/_base_/models/mae_vit-base-p16.py b/configs/_base_/models/mae_vit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cde8cb7c775d82941324f1abfa3432727b08a07
--- /dev/null
+++ b/configs/_base_/models/mae_vit-base-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='MAE',
+    backbone=dict(type='MAEViT', arch='b', patch_size=16, mask_ratio=0.75),
+    neck=dict(
+        type='MAEPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+    ),
+    head=dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    init_cfg=[
+        dict(type='Xavier', layer='Linear', distribution='uniform'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ])
diff --git a/configs/_base_/models/mixmim/mixmim_base.py b/configs/_base_/models/mixmim/mixmim_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..ccde357570d22d3e1147b14ec480fd6b31f6a4cf
--- /dev/null
+++ b/configs/_base_/models/mixmim/mixmim_base.py
@@ -0,0 +1,20 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MixMIMTransformer', arch='B', drop_rate=0.0, drop_path_rate=0.1),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/mlp_mixer_base_patch16.py b/configs/_base_/models/mlp_mixer_base_patch16.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ebd17f337bb3d6f14e0a45b40ef6f3342477090
--- /dev/null
+++ b/configs/_base_/models/mlp_mixer_base_patch16.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MlpMixer',
+        arch='b',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=dict(type='GlobalAveragePooling', dim=1),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+)
diff --git a/configs/_base_/models/mlp_mixer_large_patch16.py b/configs/_base_/models/mlp_mixer_large_patch16.py
new file mode 100644
index 0000000000000000000000000000000000000000..ff107139bc9aa202b5b60696761f4167c25b5be3
--- /dev/null
+++ b/configs/_base_/models/mlp_mixer_large_patch16.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MlpMixer',
+        arch='l',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=dict(type='GlobalAveragePooling', dim=1),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+)
diff --git a/configs/_base_/models/mobilenet_v2_1x.py b/configs/_base_/models/mobilenet_v2_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ebff1eff937a1390f23567c37debd164aeb8c9e
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v2_1x.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV2', widen_factor=1.0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..5318f50feeb7d0d3f54bd70e6f854d1a74fb0743
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='large'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=1000,
+        in_channels=960,
+        mid_channels=[1280],
+        dropout_rate=0.2,
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(
+            type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..6356efcd1bf4beacb200f9bb4a3780963c68a302
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='small_050'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=1000,
+        in_channels=288,
+        mid_channels=[1024],
+        dropout_rate=0.2,
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(
+            type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..19391ec26a2b1d86d0707a780e60033db166149c
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='small_075'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=1000,
+        in_channels=432,
+        mid_channels=[1024],
+        dropout_rate=0.2,
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(
+            type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..5dbe980c47c83733b94a7cfe5b5ae44b3dd15729
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=10,
+        in_channels=576,
+        mid_channels=[1280],
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..af6cc1b8d9dcb5b0ec21b38317950149a8a61a10
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=1000,
+        in_channels=576,
+        mid_channels=[1024],
+        dropout_rate=0.2,
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(
+            type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobileone/mobileone_s0.py b/configs/_base_/models/mobileone/mobileone_s0.py
new file mode 100644
index 0000000000000000000000000000000000000000..39624e5594e5270376a3e08719831f5e84ff234a
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s0.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s0',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobileone/mobileone_s1.py b/configs/_base_/models/mobileone/mobileone_s1.py
new file mode 100644
index 0000000000000000000000000000000000000000..cea7762e4b93d6fde21901dbcdb9593209439a5f
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s1.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s1',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobileone/mobileone_s2.py b/configs/_base_/models/mobileone/mobileone_s2.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfae0e1f1a896830d0fde43fdada9f84c3fd3e30
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s2.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s2',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobileone/mobileone_s3.py b/configs/_base_/models/mobileone/mobileone_s3.py
new file mode 100644
index 0000000000000000000000000000000000000000..813567530413cc4b73a3aef08a8b58dc9fca47e1
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s3.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s3',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobileone/mobileone_s4.py b/configs/_base_/models/mobileone/mobileone_s4.py
new file mode 100644
index 0000000000000000000000000000000000000000..282eec8bcf1ce3adf2bfc3861734f1a5b65ea7bf
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s4.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s4',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobilevit/mobilevit_s.py b/configs/_base_/models/mobilevit/mobilevit_s.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6a4e05d2c8f1fc4f7b6a6b5953ff52cdfc7a2c6
--- /dev/null
+++ b/configs/_base_/models/mobilevit/mobilevit_s.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileViT', arch='small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=640,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobilevit/mobilevit_xs.py b/configs/_base_/models/mobilevit/mobilevit_xs.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8c6ef08eb0876bd70508fe72fd81e45470ffbf8
--- /dev/null
+++ b/configs/_base_/models/mobilevit/mobilevit_xs.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileViT', arch='x_small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobilevit/mobilevit_xxs.py b/configs/_base_/models/mobilevit/mobilevit_xxs.py
new file mode 100644
index 0000000000000000000000000000000000000000..e1c26e6f3e9f559b2599589b7de690ef45ea5611
--- /dev/null
+++ b/configs/_base_/models/mobilevit/mobilevit_xxs.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileViT', arch='xx_small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=320,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mvit/mvitv2-base.py b/configs/_base_/models/mvit/mvitv2-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cb6064f627bb9ec8e80295623be6c734d1c03c9
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-base.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MViT', arch='base', drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=1000,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/mvit/mvitv2-large.py b/configs/_base_/models/mvit/mvitv2-large.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c84424311334030010f4b0651876ee8c3bc57cc
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-large.py
@@ -0,0 +1,23 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MViT',
+        arch='large',
+        drop_path_rate=0.5,
+        dim_mul_in_attention=False),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        in_channels=1152,
+        num_classes=1000,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/mvit/mvitv2-small.py b/configs/_base_/models/mvit/mvitv2-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..df895f2950cbf7aa009c308a86352147e427e309
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-small.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MViT', arch='small', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=1000,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/mvit/mvitv2-tiny.py b/configs/_base_/models/mvit/mvitv2-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..836f04bfce975487ccb05d38f47150e128313918
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-tiny.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MViT', arch='tiny', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=1000,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/poolformer/poolformer_m36.py b/configs/_base_/models/poolformer/poolformer_m36.py
new file mode 100644
index 0000000000000000000000000000000000000000..276a72122b18f0731aded4c7652897d92814d53d
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_m36.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='m36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/poolformer/poolformer_m48.py b/configs/_base_/models/poolformer/poolformer_m48.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c006acbc0d01caa8ecc66b26a3d7b0e75725dab
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_m48.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='m48',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/poolformer/poolformer_s12.py b/configs/_base_/models/poolformer/poolformer_s12.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7b3600f35813acc633845050b1280873ac7ee47
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_s12.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='s12',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/poolformer/poolformer_s24.py b/configs/_base_/models/poolformer/poolformer_s24.py
new file mode 100644
index 0000000000000000000000000000000000000000..822ab5b309c043569cfff4f124680906e9593a5b
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_s24.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='s24',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/poolformer/poolformer_s36.py b/configs/_base_/models/poolformer/poolformer_s36.py
new file mode 100644
index 0000000000000000000000000000000000000000..489f2223c0dbfe25d02dc804843ff8ce379639d2
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_s36.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='s36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_1.6gf.py b/configs/_base_/models/regnet/regnetx_1.6gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..b81f0ad25bc5c6ccf1775e580f59b86a851fb950
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_1.6gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_1.6gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=912,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_12gf.py b/configs/_base_/models/regnet/regnetx_12gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..383d4f87992d3d7cb6b9de35e2a82e371a46b12c
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_12gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_12gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2240,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_3.2gf.py b/configs/_base_/models/regnet/regnetx_3.2gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..67d454139586d60c17f5468807f761f7835fd0f7
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_3.2gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_3.2gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1008,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_4.0gf.py b/configs/_base_/models/regnet/regnetx_4.0gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..01419c64bd18a5a1f9a0c9606209726b957f24ea
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_4.0gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_4.0gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1360,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_400mf.py b/configs/_base_/models/regnet/regnetx_400mf.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef518b9f7df4484c158d24e9522a61e41cca3f15
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_400mf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_400mf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_6.4gf.py b/configs/_base_/models/regnet/regnetx_6.4gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e6222af015cd5a93e5feccdb98348f1da3991a
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_6.4gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_6.4gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1624,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_8.0gf.py b/configs/_base_/models/regnet/regnetx_8.0gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..29298268d767b45d3d5dcde4dd72663b1c407525
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_8.0gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_8.0gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1920,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_800mf.py b/configs/_base_/models/regnet/regnetx_800mf.py
new file mode 100644
index 0000000000000000000000000000000000000000..210f760fe29c104c662123af4cecef143ddc9ec3
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_800mf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_800mf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=672,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/replknet-31B_in1k.py b/configs/_base_/models/replknet-31B_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cc50959d4bfc4597269de078ecabe5c663963b2
--- /dev/null
+++ b/configs/_base_/models/replknet-31B_in1k.py
@@ -0,0 +1,25 @@
+from mmpretrain.models import build_classifier
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepLKNet',
+        arch='31B',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
+
+if __name__ == '__main__':
+    # model.pop('type')
+    model = build_classifier(model)
+    model.eval()
+    print('------------------- training-time model -------------')
+    for i in model.state_dict().keys():
+        print(i)
diff --git a/configs/_base_/models/replknet-31L_in1k.py b/configs/_base_/models/replknet-31L_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7830fb06f74a1ba2d7d437cc7733f446ecb12872
--- /dev/null
+++ b/configs/_base_/models/replknet-31L_in1k.py
@@ -0,0 +1,15 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepLKNet',
+        arch='31L',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/replknet-XL_in1k.py b/configs/_base_/models/replknet-XL_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b63f3459c9914a247e8373e1fba4cbd8b4a5a81a
--- /dev/null
+++ b/configs/_base_/models/replknet-XL_in1k.py
@@ -0,0 +1,15 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepLKNet',
+        arch='XL',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/repmlp-base_224.py b/configs/_base_/models/repmlp-base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..7db0077882168d1466fede11243f70837df29395
--- /dev/null
+++ b/configs/_base_/models/repmlp-base_224.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepMLPNet',
+        arch='B',
+        img_size=224,
+        out_indices=(3, ),
+        reparam_conv_kernels=(1, 3),
+        deploy=False),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/repvgg-A0_in1k.py b/configs/_base_/models/repvgg-A0_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..093ffb7eea9f6af6a17e6fe766ba1f1a6160b28d
--- /dev/null
+++ b/configs/_base_/models/repvgg-A0_in1k.py
@@ -0,0 +1,15 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepVGG',
+        arch='A0',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/repvgg-B3_lbs-mixup_in1k.py b/configs/_base_/models/repvgg-B3_lbs-mixup_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d88e687b35df35cd5993d24d929a686bf0af6f8b
--- /dev/null
+++ b/configs/_base_/models/repvgg-B3_lbs-mixup_in1k.py
@@ -0,0 +1,22 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepVGG',
+        arch='B3',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2560,
+        loss=dict(
+            type='LabelSmoothLoss',
+            loss_weight=1.0,
+            label_smooth_val=0.1,
+            mode='classy_vision',
+            num_classes=1000),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/res2net101-w26-s4.py b/configs/_base_/models/res2net101-w26-s4.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bf64c508f95f8f3d2eb14afbe85799a49ee69aa
--- /dev/null
+++ b/configs/_base_/models/res2net101-w26-s4.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=101,
+        scales=4,
+        base_width=26,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w14-s8.py b/configs/_base_/models/res2net50-w14-s8.py
new file mode 100644
index 0000000000000000000000000000000000000000..5875142c34d64f8414929bd43ccf37971bc97df8
--- /dev/null
+++ b/configs/_base_/models/res2net50-w14-s8.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=8,
+        base_width=14,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w26-s4.py b/configs/_base_/models/res2net50-w26-s4.py
new file mode 100644
index 0000000000000000000000000000000000000000..be8fdb585903564a9572b575b48967dd1a12c3f4
--- /dev/null
+++ b/configs/_base_/models/res2net50-w26-s4.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=4,
+        base_width=26,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w26-s6.py b/configs/_base_/models/res2net50-w26-s6.py
new file mode 100644
index 0000000000000000000000000000000000000000..281b136a67e245ee90e94bd1495b449af39118e3
--- /dev/null
+++ b/configs/_base_/models/res2net50-w26-s6.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=6,
+        base_width=26,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w26-s8.py b/configs/_base_/models/res2net50-w26-s8.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4f62f3ed19e4ba1f833a23cb5c8d434456b5b07
--- /dev/null
+++ b/configs/_base_/models/res2net50-w26-s8.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=8,
+        base_width=26,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w48-s2.py b/configs/_base_/models/res2net50-w48-s2.py
new file mode 100644
index 0000000000000000000000000000000000000000..8675c91fa008f72ddcaa10f11b91e1f6feb79953
--- /dev/null
+++ b/configs/_base_/models/res2net50-w48-s2.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=2,
+        base_width=48,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnest101.py b/configs/_base_/models/resnest101.py
new file mode 100644
index 0000000000000000000000000000000000000000..3780c1549359ec1850ce1db546d23a667e699d4f
--- /dev/null
+++ b/configs/_base_/models/resnest101.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeSt',
+        depth=101,
+        num_stages=4,
+        stem_channels=128,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            num_classes=1000,
+            reduction='mean',
+            loss_weight=1.0),
+        topk=(1, 5),
+        cal_acc=False),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnest200.py b/configs/_base_/models/resnest200.py
new file mode 100644
index 0000000000000000000000000000000000000000..40d8f03e7f528f8c0132bd2c19515460fd47fe70
--- /dev/null
+++ b/configs/_base_/models/resnest200.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeSt',
+        depth=200,
+        num_stages=4,
+        stem_channels=128,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            num_classes=1000,
+            reduction='mean',
+            loss_weight=1.0),
+        topk=(1, 5),
+        cal_acc=False),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnest269.py b/configs/_base_/models/resnest269.py
new file mode 100644
index 0000000000000000000000000000000000000000..c37626f5678630383693d784d2590f27caa11de2
--- /dev/null
+++ b/configs/_base_/models/resnest269.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeSt',
+        depth=269,
+        num_stages=4,
+        stem_channels=128,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            num_classes=1000,
+            reduction='mean',
+            loss_weight=1.0),
+        topk=(1, 5),
+        cal_acc=False),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnest50.py b/configs/_base_/models/resnest50.py
new file mode 100644
index 0000000000000000000000000000000000000000..51c90e86f468edccc3de3b0e7cd783548d220db4
--- /dev/null
+++ b/configs/_base_/models/resnest50.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeSt',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            num_classes=1000,
+            reduction='mean',
+            loss_weight=1.0),
+        topk=(1, 5),
+        cal_acc=False),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnet101.py b/configs/_base_/models/resnet101.py
new file mode 100644
index 0000000000000000000000000000000000000000..1147cd4be9aff00ad6ce66c31e2839c1a94f9ca3
--- /dev/null
+++ b/configs/_base_/models/resnet101.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet101_cifar.py b/configs/_base_/models/resnet101_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..a84d470e3a9828532e5cddcb1a3f7aa4fcae9f68
--- /dev/null
+++ b/configs/_base_/models/resnet101_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet152.py b/configs/_base_/models/resnet152.py
new file mode 100644
index 0000000000000000000000000000000000000000..94a718c3cec213727a7a2f11baeb3594fd37532e
--- /dev/null
+++ b/configs/_base_/models/resnet152.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=152,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet152_cifar.py b/configs/_base_/models/resnet152_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..55c0cc6c66dbde26bebe6d99d791c3e3f28e4e27
--- /dev/null
+++ b/configs/_base_/models/resnet152_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=152,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet18.py b/configs/_base_/models/resnet18.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c66758ee4aadced38c815e98af68b74aa310a2e
--- /dev/null
+++ b/configs/_base_/models/resnet18.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=18,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet18_cifar.py b/configs/_base_/models/resnet18_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b9cf1e7337de73aa21515547b6c3d16e2b178ea
--- /dev/null
+++ b/configs/_base_/models/resnet18_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=18,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet34.py b/configs/_base_/models/resnet34.py
new file mode 100644
index 0000000000000000000000000000000000000000..100ee286bead6b5dd88f1752660e8ab9d0498e37
--- /dev/null
+++ b/configs/_base_/models/resnet34.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=34,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet34_cifar.py b/configs/_base_/models/resnet34_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..55d033bc30bcbde7aef8e57ad950f59c248ad74b
--- /dev/null
+++ b/configs/_base_/models/resnet34_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=34,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet34_gem.py b/configs/_base_/models/resnet34_gem.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c0e0d3e8dc5d7a0b259f1624ee2402af8a401cd
--- /dev/null
+++ b/configs/_base_/models/resnet34_gem.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=34,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GeneralizedMeanPooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet50.py b/configs/_base_/models/resnet50.py
new file mode 100644
index 0000000000000000000000000000000000000000..129a2bb50c91f3034997d216f3a9efb743d9cc40
--- /dev/null
+++ b/configs/_base_/models/resnet50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet50_cifar.py b/configs/_base_/models/resnet50_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..33b66d526482245237faa2862d376797c21a8ee4
--- /dev/null
+++ b/configs/_base_/models/resnet50_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet50_cifar_cutmix.py b/configs/_base_/models/resnet50_cifar_cutmix.py
new file mode 100644
index 0000000000000000000000000000000000000000..73c38be271a90b1655ae63e4f36cf6c3a3c5fdc4
--- /dev/null
+++ b/configs/_base_/models/resnet50_cifar_cutmix.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='MultiLabelLinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+    train_cfg=dict(
+        augments=dict(type='BatchCutMix', alpha=1.0, num_classes=10,
+                      prob=1.0)))
diff --git a/configs/_base_/models/resnet50_cifar_mixup.py b/configs/_base_/models/resnet50_cifar_mixup.py
new file mode 100644
index 0000000000000000000000000000000000000000..f165c2466bd8a67cbfadd5f3a388d4fe03e6d446
--- /dev/null
+++ b/configs/_base_/models/resnet50_cifar_mixup.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='MultiLabelLinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=1.)),
+)
diff --git a/configs/_base_/models/resnet50_cutmix.py b/configs/_base_/models/resnet50_cutmix.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb79088b798d1c16eb6c336006143c2fe288e6a2
--- /dev/null
+++ b/configs/_base_/models/resnet50_cutmix.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='MultiLabelLinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+    train_cfg=dict(
+        augments=dict(
+            type='BatchCutMix', alpha=1.0, num_classes=1000, prob=1.0)))
diff --git a/configs/_base_/models/resnet50_label_smooth.py b/configs/_base_/models/resnet50_label_smooth.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6f793751904658b3e7e01a5ffdaa6b86e156e66
--- /dev/null
+++ b/configs/_base_/models/resnet50_label_smooth.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet50_mixup.py b/configs/_base_/models/resnet50_mixup.py
new file mode 100644
index 0000000000000000000000000000000000000000..23130a69c98823a6979dcd7ee7441746753a9865
--- /dev/null
+++ b/configs/_base_/models/resnet50_mixup.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='MultiLabelLinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnetv1c50.py b/configs/_base_/models/resnetv1c50.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b973e20181cd3cf1c470db84abf97aeaa0549c1
--- /dev/null
+++ b/configs/_base_/models/resnetv1c50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNetV1c',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnetv1d101.py b/configs/_base_/models/resnetv1d101.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e56223121fb22ac089800ebeb69310758d0f2e7
--- /dev/null
+++ b/configs/_base_/models/resnetv1d101.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNetV1d',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnetv1d152.py b/configs/_base_/models/resnetv1d152.py
new file mode 100644
index 0000000000000000000000000000000000000000..58cc73beb318e38f9ce79154a1265be1a7dba17b
--- /dev/null
+++ b/configs/_base_/models/resnetv1d152.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNetV1d',
+        depth=152,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnetv1d50.py b/configs/_base_/models/resnetv1d50.py
new file mode 100644
index 0000000000000000000000000000000000000000..015aaa3d8182cae50f392d7103e24e8ac8a188aa
--- /dev/null
+++ b/configs/_base_/models/resnetv1d50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNetV1d',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnext101_32x4d.py b/configs/_base_/models/resnext101_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c89fb6488701c83f12e623ae606abbe3b78799f
--- /dev/null
+++ b/configs/_base_/models/resnext101_32x4d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeXt',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnext101_32x8d.py b/configs/_base_/models/resnext101_32x8d.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bb63f3aeb8b37eb701135ed1c6bf2d15869fae3
--- /dev/null
+++ b/configs/_base_/models/resnext101_32x8d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeXt',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=8,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnext152_32x4d.py b/configs/_base_/models/resnext152_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..d392eff3dc673b0b74ed013c030152a0107799a2
--- /dev/null
+++ b/configs/_base_/models/resnext152_32x4d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeXt',
+        depth=152,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnext50_32x4d.py b/configs/_base_/models/resnext50_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..060426231e8cd845fda17ea053478cf7f57b940a
--- /dev/null
+++ b/configs/_base_/models/resnext50_32x4d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeXt',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/revvit/revvit-base.py b/configs/_base_/models/revvit/revvit-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..85b7af42ea7fd6856fd81bc99ee829fb40bce435
--- /dev/null
+++ b/configs/_base_/models/revvit/revvit-base.py
@@ -0,0 +1,27 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RevVisionTransformer',
+        arch='deit-base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/revvit/revvit-small.py b/configs/_base_/models/revvit/revvit-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd1a0b2661ac2cf54554c06bd729477b94dad908
--- /dev/null
+++ b/configs/_base_/models/revvit/revvit-small.py
@@ -0,0 +1,27 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RevVisionTransformer',
+        arch='deit-small',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/seresnet101.py b/configs/_base_/models/seresnet101.py
new file mode 100644
index 0000000000000000000000000000000000000000..137a6f90f6bca160a073877fc43ea6398fa1d0b4
--- /dev/null
+++ b/configs/_base_/models/seresnet101.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SEResNet',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/seresnet50.py b/configs/_base_/models/seresnet50.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5f6bfce8db9ed75936229bf57992a0211a95b7d
--- /dev/null
+++ b/configs/_base_/models/seresnet50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SEResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/seresnext101_32x4d.py b/configs/_base_/models/seresnext101_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc8a62c39305993bf9b717edf980a1546de12a2b
--- /dev/null
+++ b/configs/_base_/models/seresnext101_32x4d.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SEResNeXt',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        se_ratio=16,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/seresnext50_32x4d.py b/configs/_base_/models/seresnext50_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cdf7cb696be22d3a5fa5829162052c8b9b7e7a8
--- /dev/null
+++ b/configs/_base_/models/seresnext50_32x4d.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SEResNeXt',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        se_ratio=16,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/shufflenet_v1_1x.py b/configs/_base_/models/shufflenet_v1_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0f9d1fbdde759e6c13d9a02705072b3f11faf02
--- /dev/null
+++ b/configs/_base_/models/shufflenet_v1_1x.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ShuffleNetV1', groups=3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=960,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/shufflenet_v2_1x.py b/configs/_base_/models/shufflenet_v2_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..190800e343d75a89ffb67a1f7dd33db04d26429d
--- /dev/null
+++ b/configs/_base_/models/shufflenet_v2_1x.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ShuffleNetV2', widen_factor=1.0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/swin_transformer/base_224.py b/configs/_base_/models/swin_transformer/base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7c277f2d6494a6d069bcf053349d8c5df2a0bc3
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/base_224.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer', arch='base', img_size=224, drop_path_rate=0.5),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer/base_384.py b/configs/_base_/models/swin_transformer/base_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce78981fb0775bdb4048522f32e25c58e2159160
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/base_384.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='base',
+        img_size=384,
+        stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer/large_224.py b/configs/_base_/models/swin_transformer/large_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..747d00e44d4b81383998d7f18b7ae8668bf41c5f
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/large_224.py
@@ -0,0 +1,12 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='SwinTransformer', arch='large', img_size=224),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer/large_384.py b/configs/_base_/models/swin_transformer/large_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..7026f81a31de2adc445b8ce45520904205f72cee
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/large_384.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='large',
+        img_size=384,
+        stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer/small_224.py b/configs/_base_/models/swin_transformer/small_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..d87d9d9af6ce9c80581dc03925ed13b4b36893fc
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/small_224.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer', arch='small', img_size=224,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer/tiny_224.py b/configs/_base_/models/swin_transformer/tiny_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1781cf5f84fe9dd8386b29337a9fe4f6d717784
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/tiny_224.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer', arch='tiny', img_size=224, drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer_v2/base_256.py b/configs/_base_/models/swin_transformer_v2/base_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..66594db25b17a20a346fcff944f2d37d8ff860f7
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/base_256.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='base',
+        img_size=256,
+        drop_path_rate=0.5),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer_v2/base_384.py b/configs/_base_/models/swin_transformer_v2/base_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fb9aead2e98bba3f9277a02024981a1e22b6046
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/base_384.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='base',
+        img_size=384,
+        drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False))
diff --git a/configs/_base_/models/swin_transformer_v2/large_256.py b/configs/_base_/models/swin_transformer_v2/large_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..fe557c32058be1563ed50696b9f44b95b3bb3bed
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/large_256.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='large',
+        img_size=256,
+        drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer_v2/large_384.py b/configs/_base_/models/swin_transformer_v2/large_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..a626c40715d1ea2cb1fb0cda0a249d1df01544dc
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/large_384.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='large',
+        img_size=384,
+        drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer_v2/small_256.py b/configs/_base_/models/swin_transformer_v2/small_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ec706ff0e16e44027fad3ee54e93280018d76bd
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/small_256.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='small',
+        img_size=256,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer_v2/tiny_256.py b/configs/_base_/models/swin_transformer_v2/tiny_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..61055a1310ab86bea26d427fe445bc4cfe7bf89e
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/tiny_256.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='tiny',
+        img_size=256,
+        drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/t2t-vit-t-14.py b/configs/_base_/models/t2t-vit-t-14.py
new file mode 100644
index 0000000000000000000000000000000000000000..58ea660e742b1ef8edf93fb10ac1331734a4dbe5
--- /dev/null
+++ b/configs/_base_/models/t2t-vit-t-14.py
@@ -0,0 +1,42 @@
+# model settings
+embed_dims = 384
+num_classes = 1000
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='T2T_ViT',
+        img_size=224,
+        in_channels=3,
+        embed_dims=embed_dims,
+        t2t_cfg=dict(
+            token_dims=64,
+            use_performer=False,
+        ),
+        num_layers=14,
+        layer_cfgs=dict(
+            num_heads=6,
+            feedforward_channels=3 * embed_dims,  # mlp_ratio = 3
+        ),
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=.02),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=num_classes,
+        in_channels=embed_dims,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/t2t-vit-t-19.py b/configs/_base_/models/t2t-vit-t-19.py
new file mode 100644
index 0000000000000000000000000000000000000000..51741c7a7cbcfd8f13fb1574f831978a144ca1a4
--- /dev/null
+++ b/configs/_base_/models/t2t-vit-t-19.py
@@ -0,0 +1,42 @@
+# model settings
+embed_dims = 448
+num_classes = 1000
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='T2T_ViT',
+        img_size=224,
+        in_channels=3,
+        embed_dims=embed_dims,
+        t2t_cfg=dict(
+            token_dims=64,
+            use_performer=False,
+        ),
+        num_layers=19,
+        layer_cfgs=dict(
+            num_heads=7,
+            feedforward_channels=3 * embed_dims,  # mlp_ratio = 3
+        ),
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=.02),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=num_classes,
+        in_channels=embed_dims,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/t2t-vit-t-24.py b/configs/_base_/models/t2t-vit-t-24.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad772cf6e614bbca630ffad75393614415102bb9
--- /dev/null
+++ b/configs/_base_/models/t2t-vit-t-24.py
@@ -0,0 +1,42 @@
+# model settings
+embed_dims = 512
+num_classes = 1000
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='T2T_ViT',
+        img_size=224,
+        in_channels=3,
+        embed_dims=embed_dims,
+        t2t_cfg=dict(
+            token_dims=64,
+            use_performer=False,
+        ),
+        num_layers=24,
+        layer_cfgs=dict(
+            num_heads=8,
+            feedforward_channels=3 * embed_dims,  # mlp_ratio = 3
+        ),
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=.02),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=num_classes,
+        in_channels=embed_dims,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/tinyvit/tinyvit-11m.py b/configs/_base_/models/tinyvit/tinyvit-11m.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c046e35a0fe11aaa679300d3a2d3be59ff1051b
--- /dev/null
+++ b/configs/_base_/models/tinyvit/tinyvit-11m.py
@@ -0,0 +1,25 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='TinyViT',
+        arch='11m',
+        img_size=(224, 224),
+        window_size=[7, 7, 14, 7],
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=448,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/tinyvit/tinyvit-21m.py b/configs/_base_/models/tinyvit/tinyvit-21m.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f362f8f62789f6442e33a5a000ce8d9a458a597
--- /dev/null
+++ b/configs/_base_/models/tinyvit/tinyvit-21m.py
@@ -0,0 +1,25 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='TinyViT',
+        arch='21m',
+        img_size=(224, 224),
+        window_size=[7, 7, 14, 7],
+        out_indices=(3, ),
+        drop_path_rate=0.2,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=576,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/tinyvit/tinyvit-5m.py b/configs/_base_/models/tinyvit/tinyvit-5m.py
new file mode 100644
index 0000000000000000000000000000000000000000..923ebd918f82f40537e0f40f550c3cd264d7e389
--- /dev/null
+++ b/configs/_base_/models/tinyvit/tinyvit-5m.py
@@ -0,0 +1,25 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='TinyViT',
+        arch='5m',
+        img_size=(224, 224),
+        window_size=[7, 7, 14, 7],
+        out_indices=(3, ),
+        drop_path_rate=0.0,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=320,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/tnt_s_patch16_224.py b/configs/_base_/models/tnt_s_patch16_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e13d07828c5d89d0e9ce4fc1a29fe7a6a4875d4
--- /dev/null
+++ b/configs/_base_/models/tnt_s_patch16_224.py
@@ -0,0 +1,29 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='TNT',
+        arch='s',
+        img_size=224,
+        patch_size=16,
+        in_channels=3,
+        ffn_ratio=4,
+        qkv_bias=False,
+        drop_rate=0.,
+        attn_drop_rate=0.,
+        drop_path_rate=0.1,
+        first_stride=4,
+        num_fcs=2,
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=.02),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+        ]),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        topk=(1, 5),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)))
diff --git a/configs/_base_/models/twins_pcpvt_base.py b/configs/_base_/models/twins_pcpvt_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..14e46baedd273bd3baef163e2966653626170a1c
--- /dev/null
+++ b/configs/_base_/models/twins_pcpvt_base.py
@@ -0,0 +1,31 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PCPVT',
+        arch='base',
+        in_channels=3,
+        out_indices=(3, ),
+        qkv_bias=True,
+        norm_cfg=dict(type='LN', eps=1e-06),
+        norm_after_stage=[False, False, False, True],
+        drop_rate=0.0,
+        attn_drop_rate=0.,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/twins_svt_base.py b/configs/_base_/models/twins_svt_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..a37385b018f9b345ebcd3a9aaad575cd98e8b8f3
--- /dev/null
+++ b/configs/_base_/models/twins_svt_base.py
@@ -0,0 +1,31 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SVT',
+        arch='base',
+        in_channels=3,
+        out_indices=(3, ),
+        qkv_bias=True,
+        norm_cfg=dict(type='LN'),
+        norm_after_stage=[False, False, False, True],
+        drop_rate=0.0,
+        attn_drop_rate=0.,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/van/van_base.py b/configs/_base_/models/van/van_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..006459255f82f4ad4250ee01f1d9d25605beb5d1
--- /dev/null
+++ b/configs/_base_/models/van/van_base.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VAN', arch='base', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False))
diff --git a/configs/_base_/models/van/van_large.py b/configs/_base_/models/van/van_large.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ebafabdaaf7a4b828919e61e980e423385897e6
--- /dev/null
+++ b/configs/_base_/models/van/van_large.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VAN', arch='large', drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False))
diff --git a/configs/_base_/models/van/van_small.py b/configs/_base_/models/van/van_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..29393c6308af0732f4757d1ef4bd98d7b3cddcf1
--- /dev/null
+++ b/configs/_base_/models/van/van_small.py
@@ -0,0 +1,22 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VAN', arch='small', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/van/van_tiny.py b/configs/_base_/models/van/van_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cf5b28836f9216c642dfdfb62f37f3066a7ad09
--- /dev/null
+++ b/configs/_base_/models/van/van_tiny.py
@@ -0,0 +1,22 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VAN', arch='tiny', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=256,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vgg11.py b/configs/_base_/models/vgg11.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b6ee1426aae383b1db5c4451e37caec5eafdcfa
--- /dev/null
+++ b/configs/_base_/models/vgg11.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VGG', depth=11, num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg11bn.py b/configs/_base_/models/vgg11bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb4c64e95a85367841615fd52af7af50b5b1e9fb
--- /dev/null
+++ b/configs/_base_/models/vgg11bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG', depth=11, norm_cfg=dict(type='BN'), num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg13.py b/configs/_base_/models/vgg13.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9389100a61514043bbe7426b93cfd257df5cd26
--- /dev/null
+++ b/configs/_base_/models/vgg13.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VGG', depth=13, num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg13bn.py b/configs/_base_/models/vgg13bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..b12173b51b80b671fd85c9fa8ececd75881d4bd2
--- /dev/null
+++ b/configs/_base_/models/vgg13bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG', depth=13, norm_cfg=dict(type='BN'), num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg16.py b/configs/_base_/models/vgg16.py
new file mode 100644
index 0000000000000000000000000000000000000000..93ce864fac29a7c4adf4df12e5653f97ce09d7be
--- /dev/null
+++ b/configs/_base_/models/vgg16.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VGG', depth=16, num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg16bn.py b/configs/_base_/models/vgg16bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..765e34f6367bc52e10322692a849d1003d57dfd2
--- /dev/null
+++ b/configs/_base_/models/vgg16bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG', depth=16, norm_cfg=dict(type='BN'), num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg19.py b/configs/_base_/models/vgg19.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f4ab061b2c7a87d86aaebcf78aaf84abd2bb0cc
--- /dev/null
+++ b/configs/_base_/models/vgg19.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VGG', depth=19, num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg19bn.py b/configs/_base_/models/vgg19bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..c468b5dea2cc5503ca2b266c57d163b2308b7dd3
--- /dev/null
+++ b/configs/_base_/models/vgg19bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG', depth=19, norm_cfg=dict(type='BN'), num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vig/pyramid_vig_base.py b/configs/_base_/models/vig/pyramid_vig_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..a258457c84aecc2f1cdf29131f60b522526dbdd8
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_base.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PyramidVig',
+        arch='base',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_stochastic=False,
+        drop_path=0.1,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/pyramid_vig_medium.py b/configs/_base_/models/vig/pyramid_vig_medium.py
new file mode 100644
index 0000000000000000000000000000000000000000..a551aba3e079576e13f5db3a77d5e6622079e497
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_medium.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PyramidVig',
+        arch='medium',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_stochastic=False,
+        drop_path=0.1,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=768,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/pyramid_vig_small.py b/configs/_base_/models/vig/pyramid_vig_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..940275e6cf941ce0d6a7f7dc3e4a1b867cf88309
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_small.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PyramidVig',
+        arch='small',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_stochastic=False,
+        drop_path=0.1,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=640,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/pyramid_vig_tiny.py b/configs/_base_/models/vig/pyramid_vig_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..fea0734fe9ab2e962e51b819c467ad965b88a958
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_tiny.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PyramidVig',
+        arch='tiny',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_stochastic=False,
+        drop_path=0.1,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=384,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/vig_base.py b/configs/_base_/models/vig/vig_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c5f293ddfab1e8712c90f96aaa62acf62159e65
--- /dev/null
+++ b/configs/_base_/models/vig/vig_base.py
@@ -0,0 +1,33 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Vig',
+        arch='base',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_dilation=True,
+        use_stochastic=False,
+        drop_path=0.1,
+        relative_pos=False,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=640,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/vig_small.py b/configs/_base_/models/vig/vig_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..93587ffba628d8900b17a537eed1406c7af57e9a
--- /dev/null
+++ b/configs/_base_/models/vig/vig_small.py
@@ -0,0 +1,33 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Vig',
+        arch='small',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_dilation=True,
+        use_stochastic=False,
+        drop_path=0.1,
+        relative_pos=False,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=320,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/vig_tiny.py b/configs/_base_/models/vig/vig_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..c50bac222a88a665a1b7adc8398f805ff10be7f1
--- /dev/null
+++ b/configs/_base_/models/vig/vig_tiny.py
@@ -0,0 +1,33 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Vig',
+        arch='tiny',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_dilation=True,
+        use_stochastic=False,
+        drop_path=0.1,
+        relative_pos=False,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=192,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vit-base-p16.py b/configs/_base_/models/vit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb42bed5fa5ecedf9aa94c82ee63462181df0605
--- /dev/null
+++ b/configs/_base_/models/vit-base-p16.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1,
+            mode='classy_vision'),
+    ))
diff --git a/configs/_base_/models/vit-base-p32.py b/configs/_base_/models/vit-base-p32.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad550ef9b9bdbb218e6743ccf37e7929e5758865
--- /dev/null
+++ b/configs/_base_/models/vit-base-p32.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=224,
+        patch_size=32,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vit-large-p16.py b/configs/_base_/models/vit-large-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..97162304563827716366d20bd29a11fed542be62
--- /dev/null
+++ b/configs/_base_/models/vit-large-p16.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='l',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vit-large-p32.py b/configs/_base_/models/vit-large-p32.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9491bb561433ff01f60a8aa7a4993c28c8b9b02
--- /dev/null
+++ b/configs/_base_/models/vit-large-p32.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='l',
+        img_size=224,
+        patch_size=32,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/wide-resnet50.py b/configs/_base_/models/wide-resnet50.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2913b9aa6afb10c36199530441ab39348650bc7
--- /dev/null
+++ b/configs/_base_/models/wide-resnet50.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        stem_channels=64,
+        base_channels=128,
+        expansion=2,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/schedules/cifar10_bs128.py b/configs/_base_/schedules/cifar10_bs128.py
new file mode 100644
index 0000000000000000000000000000000000000000..fadb6c1285515b0d0ee7c2c17c3a9d19f4a63713
--- /dev/null
+++ b/configs/_base_/schedules/cifar10_bs128.py
@@ -0,0 +1,15 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+# learning policy
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[100, 150], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=200, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/_base_/schedules/cub_bs64.py b/configs/_base_/schedules/cub_bs64.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d0b4be7bd7b7043636fb2356b76512281a37e2b
--- /dev/null
+++ b/configs/_base_/schedules/cub_bs64.py
@@ -0,0 +1,34 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.01,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=64)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_conformer.py b/configs/_base_/schedules/imagenet_bs1024_adamw_conformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..2285d0ea6c70de222a76d6b7404fc16e5fd28e0e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_conformer.py
@@ -0,0 +1,43 @@
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        # for batch in each gpu is 128, 8 gpu
+        # lr = 5e-4 * 128 * 8 / 512 = 0.001
+        lr=5e-4 * 128 * 8 / 512,
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+        }),
+)
+
+# learning policy
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_hivit.py b/configs/_base_/schedules/imagenet_bs1024_adamw_hivit.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b2df97b813d1c3922dd470d2f0743eca44221ee
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_hivit.py
@@ -0,0 +1,41 @@
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 1024 / 512,
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_revvit.py b/configs/_base_/schedules/imagenet_bs1024_adamw_revvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..87fd202ce4076a69cae63f0d9d3f6b860639ff49
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_revvit.py
@@ -0,0 +1,41 @@
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 2048 / 512,
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-8 / 2e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py b/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd06cc115a7ab4cbaa7ef7fa1d9366bdd5db878f
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py
@@ -0,0 +1,41 @@
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 1024 / 512,
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_coslr.py b/configs/_base_/schedules/imagenet_bs1024_coslr.py
new file mode 100644
index 0000000000000000000000000000000000000000..285884d0b2b132329bab682f4418d891d7378ec1
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_coslr.py
@@ -0,0 +1,18 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=5e-5))
+
+# learning policy
+param_scheduler = [
+    dict(type='LinearLR', start_factor=0.1, by_epoch=True, begin=0, end=5),
+    dict(type='CosineAnnealingLR', T_max=95, by_epoch=True, begin=5, end=100)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py b/configs/_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf38d4731c867ac381ff0420b0063f8a7e7dfe2e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py
@@ -0,0 +1,20 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.5, momentum=0.9, weight_decay=0.00004),
+    paramwise_cfg=dict(norm_decay_mult=0),
+)
+
+# learning policy
+param_scheduler = [
+    dict(type='ConstantLR', factor=0.1, by_epoch=False, begin=0, end=5000),
+    dict(type='PolyLR', eta_min=0, by_epoch=False, begin=5000)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs2048.py b/configs/_base_/schedules/imagenet_bs2048.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cfbfbe6752d923c248b92f3c7b7ace817bad9a4
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048.py
@@ -0,0 +1,21 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='SGD', lr=0.8, momentum=0.9, weight_decay=0.0001, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    dict(
+        type='LinearLR', start_factor=0.25, by_epoch=False, begin=0, end=2500),
+    dict(
+        type='MultiStepLR', by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_AdamW.py b/configs/_base_/schedules/imagenet_bs2048_AdamW.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbfae8ef222b10663e1313000d05290d729ca5c8
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_AdamW.py
@@ -0,0 +1,39 @@
+# optimizer
+# In ClassyVision, the lr is set to 0.003 for bs4096.
+# In this implementation(bs2048), lr = 0.003 / 4096 * (32bs * 64gpus) = 0.0015
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=0.0015, weight_decay=0.3),
+    # specific to vit pretrain
+    paramwise_cfg=dict(custom_keys={
+        '.cls_token': dict(decay_mult=0.0),
+        '.pos_embed': dict(decay_mult=0.0)
+    }),
+)
+
+# learning policy
+warmup_epochs = 15  # about 10000 iterations for ImageNet-1k
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=warmup_epochs,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=warmup_epochs)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_adamw_levit.py b/configs/_base_/schedules/imagenet_bs2048_adamw_levit.py
new file mode 100644
index 0000000000000000000000000000000000000000..25a536eaac52f1c42b37e0d0b102da252deebd67
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_adamw_levit.py
@@ -0,0 +1,40 @@
+# for batch in each gpu is 256, 8 gpu
+# lr = 5e-4 * 256 * 8 / 512 = 0.002
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=0.002,
+        weight_decay=0.025,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.attention_biases': dict(decay_mult=0.0),
+        }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6 / 0.002,
+        by_epoch=True,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True,
+    ),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=5)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=1000)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_coslr.py b/configs/_base_/schedules/imagenet_bs2048_coslr.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8551f55c8082ba07c084324c2bf1fbb9f26ea56
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_coslr.py
@@ -0,0 +1,35 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='SGD', lr=0.8, momentum=0.9, weight_decay=0.0001, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.25,
+        by_epoch=True,
+        begin=0,
+        # about 2500 iterations for ImageNet-1k
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_rsb.py b/configs/_base_/schedules/imagenet_bs2048_rsb.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0d2d7994293afdc43b906c918d486397dc53206
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_rsb.py
@@ -0,0 +1,32 @@
+# optimizer
+optim_wrapper = dict(optimizer=dict(type='Lamb', lr=0.005, weight_decay=0.02))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=100)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs256.py b/configs/_base_/schedules/imagenet_bs256.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f92273d1b831ae5cd6663cfe65b1f0d8f01e630
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256.py
@@ -0,0 +1,16 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_140e.py b/configs/_base_/schedules/imagenet_bs256_140e.py
new file mode 100644
index 0000000000000000000000000000000000000000..e65bf522d9739073baf38db7f10e6b27d7cd4f31
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_140e.py
@@ -0,0 +1,16 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[40, 80, 120], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=140, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_200e_coslr_warmup.py b/configs/_base_/schedules/imagenet_bs256_200e_coslr_warmup.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8d94a7606aead6d4142bf8a61228eb6475eb5c6
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_200e_coslr_warmup.py
@@ -0,0 +1,34 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.25,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True,
+    ),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=195,
+        by_epoch=True,
+        begin=5,
+        end=200,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=200, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_coslr.py b/configs/_base_/schedules/imagenet_bs256_coslr.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e2c8bb5d0800568bb3c7079b9e0c3e1322711c
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_coslr.py
@@ -0,0 +1,16 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=100, by_epoch=True, begin=0, end=100)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_coslr_coswd_300e.py b/configs/_base_/schedules/imagenet_bs256_coslr_coswd_300e.py
new file mode 100644
index 0000000000000000000000000000000000000000..318e031574367aa9d34ec28453deccc60377372f
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_coslr_coswd_300e.py
@@ -0,0 +1,40 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=300),
+    dict(
+        type='CosineAnnealingParamScheduler',
+        param_name='weight_decay',
+        eta_min=0.00001,
+        by_epoch=True,
+        begin=0,
+        end=300)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_epochstep.py b/configs/_base_/schedules/imagenet_bs256_epochstep.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8c2b905bf362022d07d452df76c10cccfb6565e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_epochstep.py
@@ -0,0 +1,15 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.045, momentum=0.9, weight_decay=0.00004))
+
+# learning policy
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=1, gamma=0.98)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs4096_AdamW.py b/configs/_base_/schedules/imagenet_bs4096_AdamW.py
new file mode 100644
index 0000000000000000000000000000000000000000..84b1f39beaef86b412c159a54d74c4f09458dc57
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs4096_AdamW.py
@@ -0,0 +1,39 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=0.003, weight_decay=0.3),
+    # specific to vit pretrain
+    paramwise_cfg=dict(custom_keys={
+        '.cls_token': dict(decay_mult=0.0),
+        '.pos_embed': dict(decay_mult=0.0)
+    }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=30,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=270,
+        by_epoch=True,
+        begin=30,
+        end=300,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/_base_/schedules/imagenet_lars_coslr_200e.py b/configs/_base_/schedules/imagenet_lars_coslr_200e.py
new file mode 100644
index 0000000000000000000000000000000000000000..baba55c4f43b60620a646c812b24e6ffcbd7860a
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_lars_coslr_200e.py
@@ -0,0 +1,20 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=4.8, weight_decay=1e-6, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR', T_max=190, by_epoch=True, begin=10, end=200)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=200)
diff --git a/configs/_base_/schedules/imagenet_lars_coslr_90e.py b/configs/_base_/schedules/imagenet_lars_coslr_90e.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e7875a36e76eccefbf752d704fcb12beb6c6506
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_lars_coslr_90e.py
@@ -0,0 +1,14 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/_base_/schedules/imagenet_sgd_coslr_100e.py b/configs/_base_/schedules/imagenet_sgd_coslr_100e.py
new file mode 100644
index 0000000000000000000000000000000000000000..08e9a3e71fc0d8c186b8fdeb5bb59fd3a1d5148e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_sgd_coslr_100e.py
@@ -0,0 +1,14 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.3, momentum=0.9, weight_decay=1e-6))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=100, by_epoch=True, begin=0, end=100)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/_base_/schedules/imagenet_sgd_coslr_200e.py b/configs/_base_/schedules/imagenet_sgd_coslr_200e.py
new file mode 100644
index 0000000000000000000000000000000000000000..f38e4983038031c9178813297dc744195e855680
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_sgd_coslr_200e.py
@@ -0,0 +1,12 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.03, weight_decay=1e-4, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=200, by_epoch=True, begin=0, end=200)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=200)
diff --git a/configs/_base_/schedules/imagenet_sgd_steplr_100e.py b/configs/_base_/schedules/imagenet_sgd_steplr_100e.py
new file mode 100644
index 0000000000000000000000000000000000000000..75b725c7dfb074c3ebe5c7536752eb32c45b89cc
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_sgd_steplr_100e.py
@@ -0,0 +1,14 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=1e-4))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='MultiStepLR', by_epoch=True, milestones=[60, 80], gamma=0.1)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/arcface/README.md b/configs/arcface/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6b2ee6a3e6164531da343e954c9f5a20917f052d
--- /dev/null
+++ b/configs/arcface/README.md
@@ -0,0 +1,80 @@
+# ArcFace
+
+> [ArcFace: Additive Angular Margin Loss for Deep Face Recognition](https://arxiv.org/abs/1801.07698)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability. In this paper, we first introduce an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly enhances the discriminative power. Since ArcFace is susceptible to the massive label noise, we further propose sub-center ArcFace, in which each class contains K sub-centers and training samples only need to be close to any of the K positive sub-centers. Sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces. Based on this self-propelled isolation, we boost the performance through automatically purifying raw web faces under massive real-world noise. Besides discriminative feature embedding, we also explore the inverse problem, mapping feature vectors to face images. Without training any additional generator or discriminator, the pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and Batch Normalization (BN) priors. Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/212606212-8ffc3cd2-dbc1-4abf-8924-22167f3f6e34.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Retrieve image**
+
+```python
+from mmpretrain import ImageRetrievalInferencer
+
+inferencer = ImageRetrievalInferencer('resnet50-arcface_inshop', prototype='demo/')
+predict = inferencer('demo/dog.jpg', topk=2)[0]
+print(predict[0])
+print(predict[1])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet50-arcface_inshop', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/arcface/resnet50-arcface_8xb32_inshop.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/arcface/resnet50-arcface_8xb32_inshop.py https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Retrieval on InShop
+
+|           Model           |                      Pretrain                      | Params(M) | Flops(G) | Recall@1 | mAP@10 |                    Config                    |                      Download                      |
+| :-----------------------: | :------------------------------------------------: | :-------: | :------: | :------: | :----: | :------------------------------------------: | :------------------------------------------------: |
+| `resnet50-arcface_inshop` | [ImageNet-21k-mill](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth) |   31.69   |  16.48   |  90.18   | 69.30  | [config](./resnet50-arcface_8xb32_inshop.py) | [model](https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.log) |
+
+## Citation
+
+```bibtex
+@inproceedings{deng2018arcface,
+title={ArcFace: Additive Angular Margin Loss for Deep Face Recognition},
+author={Deng, Jiankang and Guo, Jia and Niannan, Xue and Zafeiriou, Stefanos},
+booktitle={CVPR},
+year={2019}
+}
+```
diff --git a/configs/arcface/metafile.yml b/configs/arcface/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..050aba5b3e1c2980234aef13767106ed237eee12
--- /dev/null
+++ b/configs/arcface/metafile.yml
@@ -0,0 +1,28 @@
+Collections:
+  - Name: ArcFace
+    Metadata:
+      Training Data: InShop
+      Architecture:
+        - Additive Angular Margin Loss
+    Paper:
+      URL: https://arxiv.org/abs/1801.07698
+      Title: 'ArcFace: Additive Angular Margin Loss for Deep Face Recognition'
+    README: configs/arcface/README.md
+    Code:
+      Version: v1.0.0rc3
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/heads/margin_head.py
+
+Models:
+  - Name: resnet50-arcface_inshop
+    Metadata:
+      FLOPs: 16571226112
+      Parameters: 31693888
+    In Collection: ArcFace
+    Results:
+      - Dataset: InShop
+        Metrics:
+          Recall@1: 90.18
+          mAP@10: 69.30
+        Task: Image Retrieval
+    Weights: https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth
+    Config: configs/arcface/resnet50-arcface_8xb32_inshop.py
diff --git a/configs/arcface/resnet50-arcface_8xb32_inshop.py b/configs/arcface/resnet50-arcface_8xb32_inshop.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc351e7870415a687679a1970bba0c24ebc02884
--- /dev/null
+++ b/configs/arcface/resnet50-arcface_8xb32_inshop.py
@@ -0,0 +1,71 @@
+_base_ = [
+    '../_base_/datasets/inshop_bs32_448.py',
+    '../_base_/schedules/cub_bs64.py',
+    '../_base_/default_runtime.py',
+]
+
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth'  # noqa
+model = dict(
+    type='ImageToImageRetriever',
+    image_encoder=[
+        dict(
+            type='ResNet',
+            depth=50,
+            init_cfg=dict(
+                type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='ArcFaceClsHead',
+        num_classes=3997,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=None),
+    prototype={{_base_.gallery_dataloader}})
+
+# runtime settings
+default_hooks = dict(
+    # log every 20 intervals
+    logger=dict(type='LoggerHook', interval=20),
+    # save last three checkpoints
+    checkpoint=dict(
+        type='CheckpointHook',
+        save_best='auto',
+        interval=1,
+        max_keep_ckpts=3,
+        rule='greater'))
+
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0005, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.01,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=50, val_interval=1)
+
+auto_scale_lr = dict(enable=True, base_batch_size=256)
+
+custom_hooks = [
+    dict(type='PrepareProtoBeforeValLoopHook'),
+    dict(type='SyncBuffersHook')
+]
diff --git a/configs/barlowtwins/README.md b/configs/barlowtwins/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..515d138856b170378ecfeb213aff6c582442f335
--- /dev/null
+++ b/configs/barlowtwins/README.md
@@ -0,0 +1,85 @@
+# BarlowTwins
+
+> [Barlow Twins: Self-Supervised Learning via Redundancy Reduction](https://arxiv.org/abs/2103.03230)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/163914714-082de804-0b5f-4024-94f9-880e6ef334fa.png" width="800" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('barlowtwins_resnet50_8xb256-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                         | Params (M) | Flops (G) |                          Config                          |                                     Download                                     |
+| :-------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :------------------------------------------------------------------------------: |
+| `barlowtwins_resnet50_8xb256-coslr-300e_in1k` |   174.54   |   4.11    | [config](barlowtwins_resnet50_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k` | [BARLOWTWINS](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) |   25.56    |   4.11    |   71.80   | [config](benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{zbontar2021barlow,
+  title={Barlow twins: Self-supervised learning via redundancy reduction},
+  author={Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St{\'e}phane},
+  booktitle={International Conference on Machine Learning},
+  year={2021},
+}
+```
diff --git a/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f12dd2e1460094e98cbc14f8bb81f67a95cb161d
--- /dev/null
+++ b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_byol.py',
+    '../_base_/default_runtime.py',
+]
+# datasets
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='BarlowTwins',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=8192,
+        out_channels=8192,
+        num_layers=3,
+        with_last_bn=False,
+        with_last_bn_affine=False,
+        with_avg_pool=True,
+        init_cfg=dict(
+            type='Kaiming', distribution='uniform', layer=['Linear'])),
+    head=dict(
+        type='LatentCrossCorrelationHead',
+        in_channels=8192,
+        loss=dict(type='CrossCorrelationLoss')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(
+                decay_mult=0, lr_mult=0.024, lars_exclude=True),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.6e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=990,
+        eta_min=0.0016,
+        by_epoch=True,
+        begin=10,
+        end=1000,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1000)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..74a7f2b9bb09a3d2cb0da644935c5f2d181bd5f4
--- /dev/null
+++ b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_byol.py',
+    '../_base_/default_runtime.py',
+]
+# datasets
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='BarlowTwins',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=8192,
+        out_channels=8192,
+        num_layers=3,
+        with_last_bn=False,
+        with_last_bn_affine=False,
+        with_avg_pool=True,
+        init_cfg=dict(
+            type='Kaiming', distribution='uniform', layer=['Linear'])),
+    head=dict(
+        type='LatentCrossCorrelationHead',
+        in_channels=8192,
+        loss=dict(type='CrossCorrelationLoss')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(
+                decay_mult=0, lr_mult=0.024, lars_exclude=True),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.6e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=0.0016,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py b/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f4e4f574ffd130abff07f9b1e2ec22b80fbbaba
--- /dev/null
+++ b/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_coslr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/barlowtwins/metafile.yml b/configs/barlowtwins/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..705080e09af9c59ecc88737073deed6de170664c
--- /dev/null
+++ b/configs/barlowtwins/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+  - Name: BarlowTwins
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 8x A100 GPUs
+      Architecture:
+        - ResNet
+        - BarlowTwins
+    Paper:
+      Title: 'Barlow Twins: Self-Supervised Learning via Redundancy Reduction'
+      URL: https://arxiv.org/abs/2103.03230
+    README: configs/barlowtwins/README.md
+
+Models:
+  - Name: barlowtwins_resnet50_8xb256-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 4109364224
+      Parameters: 174535744
+      Training Data: ImageNet-1k
+    In Collection: BarlowTwins
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth
+    Config: configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+    Downstream:
+      - resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
+  - Name: resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: BarlowTwins
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
+    Config: configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
diff --git a/configs/beit/README.md b/configs/beit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..404e6524a4db0e73daffd277386131717bd4106d
--- /dev/null
+++ b/configs/beit/README.md
@@ -0,0 +1,88 @@
+# BEiT
+
+> [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/203688351-adac7146-4e71-4ab6-8958-5cfe643a2dc5.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_beit-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('beit_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                           | Params (M) | Flops (G) |                           Config                           |                                   Download                                   |
+| :---------------------------------------------- | :--------: | :-------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------: |
+| `beit_beit-base-p16_8xb256-amp-coslr-300e_in1k` |   86.53    |   17.58   | [config](beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                   |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                  Download                  |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `beit-base-p16_beit-pre_8xb128-coslr-100e_in1k` | [BEIT](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) |   86.53    |   17.58   |   83.10   |    N/A    | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.json) |
+| `beit-base-p16_beit-in21k-pre_3rdparty_in1k`\* |             BEIT ImageNet-21k              |   86.53    |   17.58   |   85.28   |   97.59   | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{bao2022beit,
+    title={{BE}iT: {BERT} Pre-Training of Image Transformers},
+    author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
+    booktitle={International Conference on Learning Representations},
+    year={2022},
+}
+```
diff --git a/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5786f79ef207f1e54b9ded1903c6b3a7b632b4f3
--- /dev/null
+++ b/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,130 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='TwoNormDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    second_mean=[-31.875, -31.875, -31.875],
+    second_std=[318.75, 318.75, 318.75],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ColorJitter',
+        brightness=0.4,
+        contrast=0.4,
+        saturation=0.4,
+        hue=0.),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        second_size=112,
+        interpolation='bicubic',
+        second_interpolation='lanczos',
+        scale=(0.08, 1.0)),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=None,
+        min_num_patches=16),
+    dict(type='PackInputs')
+]
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='BEiT',
+    backbone=dict(
+        type='BEiTPretrainViT',
+        arch='base',
+        patch_size=16,
+        drop_path_rate=0.1,
+        final_norm=True,
+        out_type='raw',
+        layer_scale_init_value=0.1,
+        init_cfg=[
+            dict(type='TruncNormal', std=0.02, layer='Linear'),
+            dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=None,
+    head=dict(
+        type='BEiTV1Head',
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='DALL-E',
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa: E251
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/dalle_encoder.pth',  # noqa: E501
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dbab34f6e084f5c9959cfb233174a0dc059e0930
--- /dev/null
+++ b/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,127 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+data_preprocessor = dict(
+    num_classes=1000,
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        _delete_=True,
+        layer_decay_rate=0.65,
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py b/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8380b69afc061d1934fae3eba57b7f352a508b1e
--- /dev/null
+++ b/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/beit/metafile.yml b/configs/beit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e4524faec783292836fcb2520e9cff5c2262e93d
--- /dev/null
+++ b/configs/beit/metafile.yml
@@ -0,0 +1,69 @@
+Collections:
+  - Name: BEiT
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'BEiT: BERT Pre-Training of Image Transformers'
+      URL: https://arxiv.org/abs/2106.08254
+    README: configs/beit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: beit_beit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data: ImageNet-1k
+    In Collection: BEiT
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth
+    Config: configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
+  - Name: beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data: ImageNet-1k
+    In Collection: BEiT
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.1
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
+    Config: configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: beit-base-p16_beit-in21k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: BEiT
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.28
+          Top 5 Accuracy: 97.59
+    Weights: https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth
+    Config: configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
+    Converted From:
+      Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_base_patch16_224_pt22k_ft22kto1k.pth
+      Code: https://github.com/microsoft/unilm/tree/master/beit
diff --git a/configs/beitv2/README.md b/configs/beitv2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5447e2d3a36e1d1e0f3d6800c4cc2e2380fdc012
--- /dev/null
+++ b/configs/beitv2/README.md
@@ -0,0 +1,90 @@
+# BEiTv2
+
+> [BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers](https://arxiv.org/abs/2208.06366)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/203912182-5967a520-d455-49ea-bc67-dcbd500d76bf.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                             | Params (M) | Flops (G) |                            Config                            |                                 Download                                 |
+| :------------------------------------------------ | :--------: | :-------: | :----------------------------------------------------------: | :----------------------------------------------------------------------: |
+| `beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k` |   192.81   |   17.58   | [config](beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                   |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                  Download                  |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k` | [BEITV2](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) |   86.53    |   17.58   |   85.00   |    N/A    | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.json) |
+| `beit-base-p16_beitv2-in21k-pre_3rdparty_in1k`\* |            BEITV2 ImageNet-21k             |   86.53    |   17.58   |   86.47   |   97.99   | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{beitv2,
+    title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
+    author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
+    year={2022},
+    eprint={2208.06366},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4a2070b5de3ebbe93ed0b0658ee9157a6b62136
--- /dev/null
+++ b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
@@ -0,0 +1,119 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_beitv2.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+vqkd_encoder = dict(
+    arch='base',
+    img_size=224,
+    patch_size=16,
+    in_channels=3,
+    out_indices=-1,
+    drop_rate=0.,
+    drop_path_rate=0.,
+    norm_cfg=dict(type='LN', eps=1e-6),
+    final_norm=True,
+    out_type='featmap',
+    with_cls_token=True,
+    frozen_stages=-1,
+    use_abs_pos_emb=True,
+    use_rel_pos_bias=False,
+    use_shared_rel_pos_bias=False,
+    layer_scale_init_value=0.,
+    interpolate_mode='bicubic',
+    patch_cfg=dict(),
+    layer_cfgs=dict(),
+    init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0.1  # 0. for 300 epochs and 0.1 for 1600 epochs.
+model = dict(
+    type='BEiT',
+    backbone=dict(
+        type='BEiTPretrainViT',
+        arch='base',
+        patch_size=16,
+        out_indices=[-4, -1],
+        drop_path_rate=drop_path_rate,
+        final_norm=False,
+        out_type='raw',
+        layer_scale_init_value=layer_scale_init_value,
+        init_cfg=[
+            dict(type='TruncNormal', std=0.02, layer='Linear'),
+            dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type='BEiTV2Neck',
+        num_layers=2,
+        early_layers=9,
+        backbone_arch='base',
+        drop_path_rate=drop_path_rate,
+        layer_scale_init_value=layer_scale_init_value,
+    ),
+    head=dict(
+        type='BEiTV2Head',
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='VQKD',
+        encoder_config=vqkd_encoder,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth'  # noqa
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fddeccff1998fa850097ca4ae07b6fe874476dd0
--- /dev/null
+++ b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,119 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_beitv2.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+vqkd_encoder = dict(
+    arch='base',
+    img_size=224,
+    patch_size=16,
+    in_channels=3,
+    out_indices=-1,
+    drop_rate=0.,
+    drop_path_rate=0.,
+    norm_cfg=dict(type='LN', eps=1e-6),
+    final_norm=True,
+    out_type='featmap',
+    with_cls_token=True,
+    frozen_stages=-1,
+    use_abs_pos_emb=True,
+    use_rel_pos_bias=False,
+    use_shared_rel_pos_bias=False,
+    layer_scale_init_value=0.,
+    interpolate_mode='bicubic',
+    patch_cfg=dict(),
+    layer_cfgs=dict(),
+    init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0.  # 0. for 300 epochs and 0.1 for 1600 epochs.
+model = dict(
+    type='BEiT',
+    backbone=dict(
+        type='BEiTPretrainViT',
+        arch='base',
+        patch_size=16,
+        out_indices=[-4, -1],
+        drop_path_rate=drop_path_rate,
+        final_norm=False,
+        out_type='raw',
+        layer_scale_init_value=layer_scale_init_value,
+        init_cfg=[
+            dict(type='TruncNormal', std=0.02, layer='Linear'),
+            dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type='BEiTV2Neck',
+        num_layers=2,
+        early_layers=9,
+        backbone_arch='base',
+        drop_path_rate=drop_path_rate,
+        layer_scale_init_value=layer_scale_init_value,
+    ),
+    head=dict(
+        type='BEiTV2Head',
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='VQKD',
+        encoder_config=vqkd_encoder,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth'  # noqa
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.98), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2c55a706b351d5c8bd7981aaa324877cb440b11
--- /dev/null
+++ b/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,122 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        # 0.2 for 1600 epochs pretrained models and 0.1 for 300 epochs.
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=5e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        _delete_=True,
+        # 0.6 for 1600 epochs pretrained models and 0.65 for 300 epochs
+        layer_decay_rate=0.65,
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py b/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..17ed4ff3d2cf40f8d819add1b3aa4f668a41128a
--- /dev/null
+++ b/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/beitv2/metafile.yml b/configs/beitv2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..74c3885e11cd8140cea7aac40973ade4ce4e7e64
--- /dev/null
+++ b/configs/beitv2/metafile.yml
@@ -0,0 +1,69 @@
+Collections:
+  - Name: BEiTv2
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers'
+      URL: https://arxiv.org/abs/2208.06366
+    README: configs/beitv2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581223424
+      Parameters: 192811376
+      Training Data: ImageNet-1k
+    In Collection: BEiTv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth
+    Config: configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
+  - Name: beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data: ImageNet-1k
+    In Collection: BEiTv2
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.0
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
+    Config: configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: beit-base-p16_beitv2-in21k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: BEiTv2
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 86.47
+          Top 5 Accuracy: 97.99
+    Weights: https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth
+    Config: configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
+    Converted From:
+      Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beitv2/beitv2_base_patch16_224_pt1k_ft21kto1k.pth
+      Code: https://github.com/microsoft/unilm/tree/master/beit2
diff --git a/configs/blip/README.md b/configs/blip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1a8dce392cb3ec3ab36eed8ab9b3af90ee0f1219
--- /dev/null
+++ b/configs/blip/README.md
@@ -0,0 +1,128 @@
+# BLIP
+
+> [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/236374275-94d2f94b-d9a7-4f12-b694-f15a2be00be6.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('blip-base_3rdparty_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a puppy and a cat sitting on a blanket'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/blip/blip-base_8xb32_caption.py https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                          | Params (M) | BLEU-4 | CIDER  |                 Config                 |                                                    Download                                                    |
+| :----------------------------- | :--------: | :----: | :----: | :------------------------------------: | :------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_caption`\* |   223.97   | 40.12  | 132.82 | [config](./blip-base_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
+
+### Image Caption on NoCaps
+
+| Model                          | Params (M) | SPICE | CIDER  |                Config                 |                                                     Download                                                     |
+| :----------------------------- | :--------: | :---: | :----: | :-----------------------------------: | :--------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_caption`\* |   223.97   | 14.69 | 109.12 | [config](./blip-base_8xb32_nocaps.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
+
+### Image Caption on Flickr30k
+
+| Model                          | Params (M) | SPICE | CIDER |                      Config                      |                                                Download                                                |
+| :----------------------------- | :--------: | :---: | :---: | :----------------------------------------------: | :----------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_caption`\* |   223.97   | 15.58 | 68.89 | [config](./blip-base_8xb32_caption_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
+
+### Visual Grounding on RefCOCO
+
+| Model                     | Params (M) | Accuracy (testA) | Accuracy (testB) |                Config                |                                             Download                                              |
+| :------------------------ | :--------: | :--------------: | :--------------: | :----------------------------------: | :-----------------------------------------------------------------------------------------------: |
+| `blip-base_8xb16_refcoco` |   498.49   |      86.14       |      77.33       | [config](blip-base_8xb16_refcoco.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.pth) \| [log](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.json) |
+
+### Visual Question Answering on VQAv2
+
+| Model                      | Params (M) | Accuracy |               Config               |                                                       Download                                                        |
+| :------------------------- | :--------: | :------: | :--------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_vqa`\* |   361.48   |  78.20   | [config](./blip-base_8xb32_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
+
+### Visual Question Answering on OK-VQA
+
+| Model                      | Params (M) | Accuracy |                Config                |                                                       Download                                                        |
+| :------------------------- | :--------: | :------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_vqa`\* |   361.48   |  40.59#  | [config](./blip-base_8xb32_okvqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
+
+### Visual Question Answering on OCR-VQA
+
+| Model                      | Params (M) | Accuracy |                Config                 |                                                       Download                                                        |
+| :------------------------- | :--------: | :------: | :-----------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_vqa`\* |   361.48   |  28.30#  | [config](./blip-base_8xb32_ocrvqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
+
+### Image-To-Text Retrieval on COCO
+
+| Model                            | Params (M) | Recall@1 | Recall@5 |                  Config                  |                                                Download                                                |
+| :------------------------------- | :--------: | :------: | :------: | :--------------------------------------: | :----------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* |   447.49   |  82.52   |  95.34   | [config](./blip-base_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### Text-To-Image Retrieval on COCO
+
+| Model                            | Params (M) | Recall@1 | Recall@5 |                  Config                  |                                                Download                                                |
+| :------------------------------- | :--------: | :------: | :------: | :--------------------------------------: | :----------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* |   447.49   |  64.82   |  86.28   | [config](./blip-base_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### Image-To-Text Retrieval on Flickr30k
+
+| Model                            | Params (M) | Recall@1 | Recall@5 |                       Config                       |                                           Download                                           |
+| :------------------------------- | :--------: | :------: | :------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* |   447.49   |  95.10#  |  99.60#  | [config](./blip-base_8xb32_retrieval_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### Text-To-Image Retrieval on Flickr30k
+
+| Model                            | Params (M) | Recall@1 | Recall@5 |                       Config                       |                                           Download                                           |
+| :------------------------------- | :--------: | :------: | :------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* |   447.49   |  85.26#  |  96.58#  | [config](./blip-base_8xb32_retrieval_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### NLVR on NLVR2
+
+| Model                       | Params (M) | Top-1 (%) |               Config                |                                                    Download                                                    |
+| :-------------------------- | :--------: | :-------: | :---------------------------------: | :------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_nlvr`\* |   259.37   |   82.33   | [config](./blip-base_8xb32_nlvr.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_nlvr_20230427-3b14d33f.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+*Results with # denote zero-shot evaluation. The corresponding model hasn't been finetuned on that dataset.*
+
+## Citation
+
+```bibtex
+@inproceedings{li2022blip,
+      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
+      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
+      year={2022},
+      booktitle={ICML},
+}
+```
diff --git a/configs/blip/blip-base_8xb16_refcoco.py b/configs/blip/blip-base_8xb16_refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4986143a3d6965f7176bbcea445f675cc9a80ec
--- /dev/null
+++ b/configs/blip/blip-base_8xb16_refcoco.py
@@ -0,0 +1,62 @@
+_base_ = [
+    '../_base_/datasets/refcoco.py',
+    '../_base_/default_runtime.py',
+]
+
+med_config = {
+    'architectures': ['BertModel'],
+    'attention_probs_dropout_prob': 0.1,
+    'hidden_act': 'gelu',
+    'hidden_dropout_prob': 0.1,
+    'hidden_size': 768,
+    'initializer_range': 0.02,
+    'intermediate_size': 3072,
+    'layer_norm_eps': 1e-12,
+    'max_position_embeddings': 512,
+    'model_type': 'bert',
+    'num_attention_heads': 12,
+    'num_hidden_layers': 12,
+    'pad_token_id': 0,
+    'add_type_embeddings': False,
+    'vocab_size': 30524,
+    'encoder_width': 768,
+    'add_cross_attention': True
+}
+
+model = dict(
+    type='BlipGrounding',
+    visual_encoder=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    text_encoder=dict(
+        type='XBertEncoder',
+        med_config=med_config,
+    ),
+    multimodal_encoder=dict(
+        type='XBertEncoder',
+        med_config=med_config,
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    head=dict(
+        type='GroundingHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=med_config,
+        ),
+        box_l1_loss_coeff=4.0,
+        box_giou_loss_coeff=2.0,
+    ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=1.5e-5, weight_decay=0.02)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(by_epoch=True, max_epochs=120)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_caption.py b/configs/blip/blip-base_8xb32_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e24e9eababa53b17ac38502ea37eb6a9de40cf5
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_caption.py
@@ -0,0 +1,59 @@
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipCaption',
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    decoder_head=dict(
+        type='SeqGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+    ),
+    prompt='a picture of ',
+    max_txt_len=20,
+)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_caption_flickr30k.py b/configs/blip/blip-base_8xb32_caption_flickr30k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fe6ec561d6b7cd09d2490e8fb50f4f8315a14ba
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_caption_flickr30k.py
@@ -0,0 +1,59 @@
+_base_ = [
+    '../_base_/datasets/flickr30k_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipCaption',
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    decoder_head=dict(
+        type='SeqGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+    ),
+    prompt='a picture of ',
+    max_txt_len=20,
+)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_nlvr.py b/configs/blip/blip-base_8xb32_nlvr.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a6cfe149a07b508830069ba8b8ec4e3ccccc7c0
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_nlvr.py
@@ -0,0 +1,59 @@
+_base_ = [
+    '../_base_/datasets/nlvr2.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipNLVR',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    multimodal_backbone=dict(
+        type='BertModel',
+        config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True,
+            nlvr=True),
+        add_pooling_layer=False),
+)
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(logger=dict(interval=1))
diff --git a/configs/blip/blip-base_8xb32_nocaps.py b/configs/blip/blip-base_8xb32_nocaps.py
new file mode 100644
index 0000000000000000000000000000000000000000..c47c56aeec9f6b9f36b35d4ea8c078c06df586ab
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_nocaps.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/datasets/nocaps.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipCaption',
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    decoder_head=dict(
+        type='SeqGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+    ),
+    prompt='a picture of ',
+    max_txt_len=20,
+)
+
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_ocrvqa.py b/configs/blip/blip-base_8xb32_ocrvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..117d597fcb2d92aab1c0f0bc79aa895a3ab99643
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_ocrvqa.py
@@ -0,0 +1,75 @@
+_base_ = [
+    '../_base_/datasets/ocrvqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipVQA',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=480,
+        patch_size=16,
+        out_type='raw'),
+    multimodal_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    head=dict(
+        type='VQAGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+        inference_method='generate',
+    ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(max_epochs=10, by_epoch=True)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+randomness = dict(seed=42)
diff --git a/configs/blip/blip-base_8xb32_okvqa.py b/configs/blip/blip-base_8xb32_okvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..548775c4e0f91128f41701042346b5d4a2567950
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_okvqa.py
@@ -0,0 +1,75 @@
+_base_ = [
+    '../_base_/datasets/coco_okvqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipVQA',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=480,
+        patch_size=16,
+        out_type='raw'),
+    multimodal_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    head=dict(
+        type='VQAGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+        inference_method='generate',
+    ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(max_epochs=10, by_epoch=True)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+randomness = dict(seed=42)
diff --git a/configs/blip/blip-base_8xb32_retrieval.py b/configs/blip/blip-base_8xb32_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..645f88fd2a8e7ca06c75f603b7ad55539ef60053
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_retrieval.py
@@ -0,0 +1,83 @@
+_base_ = [
+    '../_base_/datasets/coco_retrieval.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipRetrieval',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    text_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    vision_neck=dict(
+        type='Linear',
+        in_features=768,
+        out_features=256,
+    ),
+    text_neck=dict(
+        type='Linear',
+        in_features=768,
+        out_features=256,
+    ),
+    head=dict(
+        type='ITCHead',
+        embed_dim=256,
+    ),
+    multimodal_head=dict(
+        type='ITMHead',
+        hidden_size=768,
+        with_pooler=False,
+    ),
+    topk=256,
+    max_txt_len=35,
+)
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.04)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+# learning rate scheduler
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=6)
+val_cfg = dict(type='RetrievalValLoop')
+test_cfg = dict(type='RetrievalTestLoop')
+
+randomness = dict(seed=42)
+
+default_hooks = dict(logger=dict(interval=1))
+
+custom_hooks = [
+    dict(
+        type='WarmupParamHook',
+        param_name='alpha',
+        module_name='head',
+        warmup_epochs=2)
+]
diff --git a/configs/blip/blip-base_8xb32_retrieval_flickr30k.py b/configs/blip/blip-base_8xb32_retrieval_flickr30k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d2e78e943161ec57539096aff5cbc7ae5f29186
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_retrieval_flickr30k.py
@@ -0,0 +1,83 @@
+_base_ = [
+    '../_base_/datasets/flickr30k_retrieval.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipRetrieval',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    text_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    vision_neck=dict(
+        type='Linear',
+        in_features=768,
+        out_features=256,
+    ),
+    text_neck=dict(
+        type='Linear',
+        in_features=768,
+        out_features=256,
+    ),
+    head=dict(
+        type='ITCHead',
+        embed_dim=256,
+    ),
+    multimodal_head=dict(
+        type='ITMHead',
+        hidden_size=768,
+        with_pooler=False,
+    ),
+    topk=256,
+    max_txt_len=35,
+)
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.04)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+# learning rate scheduler
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=6)
+val_cfg = dict(type='RetrievalValLoop')
+test_cfg = dict(type='RetrievalTestLoop')
+
+randomness = dict(seed=42)
+
+default_hooks = dict(logger=dict(interval=1))
+
+custom_hooks = [
+    dict(
+        type='WarmupParamHook',
+        param_name='alpha',
+        module_name='head',
+        warmup_epochs=2)
+]
diff --git a/configs/blip/blip-base_8xb32_vqa.py b/configs/blip/blip-base_8xb32_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..2aa3f258579617d31b52b6e5a8e7703c56966dd4
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_vqa.py
@@ -0,0 +1,76 @@
+_base_ = [
+    '../_base_/datasets/coco_vg_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipVQA',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=480,
+        patch_size=16,
+        out_type='raw'),
+    multimodal_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    head=dict(
+        type='VQAGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+        inference_method='rank',  # or 'generate'
+        answer_list_path=
+        'https://storage.googleapis.com/sfr-vision-language-research/datasets/answer_list.json',  # noqa: E501
+    ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(max_epochs=10, by_epoch=True)
+test_cfg = dict()
+
+# runtime settings
+randomness = dict(seed=42)
diff --git a/configs/blip/metafile.yml b/configs/blip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8877e8192110df35415c875c834fc914bd3a038c
--- /dev/null
+++ b/configs/blip/metafile.yml
@@ -0,0 +1,99 @@
+Collections:
+  - Name: BLIP
+    Metadata:
+      Training Data:
+        - COCO
+        - VG
+        - Conceptual Captions
+        - Conceptual 12M
+        - SBU captions
+      Architecture:
+        - Transformer
+      Training Resources: 8x A100 GPUs
+    Paper:
+      Title: 'BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language
+        Understanding and Generation'
+      URL: https://arxiv.org/abs/2201.12086
+    README: configs/blip/README.md
+
+Models:
+  - Name: blip-base_8xb16_refcoco
+    Metadata:
+      FLOPs: null
+      Parameters: 498488636
+    In Collection: BLIP
+    Results:
+      - Task: Visual Grounding
+        Dataset: RefCOCO
+        Metrics:
+          Accuracy (testA): 86.14
+          Accuracy (testB): 77.33
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.pth
+    Config: configs/blip/blip-base_8xb16_refcoco.py
+  - Name: blip-base_3rdparty_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 223971644
+    In Collection: BLIP
+    Results:
+      - Dataset: COCO
+        Task: Image Caption
+        Metrics:
+          BLEU-4: 40.12
+          CIDER: 132.82
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth
+    Config: configs/blip/blip-base_8xb32_caption.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP/blip_coco_caption_base.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip-base_3rdparty_nlvr
+    Metadata:
+      FLOPs: null
+      Parameters: 259372034
+    In Collection: BLIP
+    Results:
+      - Task: NLVR
+        Dataset: NLVR2
+        Metrics:
+          Top 1 Accuracy: 82.33
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_nlvr_20230427-3b14d33f.pth
+    Config: configs/blip/blip-base_8xb32_nlvr.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_nlvr.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip-base_3rdparty_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 361478972
+    In Collection: BLIP
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 78.2
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth
+    Config: configs/blip/blip-base_8xb32_vqa.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip-base_3rdparty_retrieval
+    Metadata:
+      FLOPs: null
+      Parameters: 447486979
+    In Collection: BLIP
+    Results:
+      - Task: Image-To-Text Retrieval
+        Dataset: COCO
+        Metrics:
+          Recall@1: 82.52
+          Recall@5: 95.34
+      - Task: Text-To-Image Retrieval
+        Dataset: COCO
+        Metrics:
+          Recall@1: 64.82
+          Recall@5: 86.28
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth
+    Config: configs/blip/blip-base_8xb32_retrieval.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP/blip_coco_retrieval.pth
+      Code: https://github.com/salesforce/LAVIS
diff --git a/configs/blip2/README.md b/configs/blip2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..68ce679d704dfc23a0afdd7ec2528df9d144547e
--- /dev/null
+++ b/configs/blip2/README.md
@@ -0,0 +1,74 @@
+# BLIP-2
+
+> [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](http://arxiv.org/abs/2301.12597)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The cost of vision-and-language pre-training has become increasingly prohibitive due to end-toend training of large-scale models. This paper proposes BLIP-2, a generic and efficient pretraining strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pretrained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various visionlanguage tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model’s emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/236385045-dc22a621-0a9c-4352-afa4-ca3888044850.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('blip2-opt2.7b_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a dog and a cat sitting on a blanket'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/blip2/blip2_8xb32_retrieval.py https://download.openmmlab.com/mmclassification/v1/blip2/blip2_3rdparty_pretrain_20230505-f7ef4390.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                                       | Params (M) | BLEU-4 | CIDER  |                   Config                   |                                           Download                                            |
+| :------------------------------------------ | :--------: | :----: | :----: | :----------------------------------------: | :-------------------------------------------------------------------------------------------: |
+| `blip2-opt2.7b_3rdparty-zeroshot_caption`\* |  3770.47   | 32.90  | 111.10 | [config](./blip2-opt2.7b_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth) |
+
+### Visual Question Answering on VQAv2
+
+| Model                                   | Params (M) | Accuracy |                 Config                 |                                                 Download                                                  |
+| :-------------------------------------- | :--------: | :------: | :------------------------------------: | :-------------------------------------------------------------------------------------------------------: |
+| `blip2-opt2.7b_3rdparty-zeroshot_vqa`\* |  3770.47   |  53.50   | [config](./blip2-opt2.7b_8xb16_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth) |
+
+### Image-To-Text Retrieval on COCO
+
+| Model                        | Params (M) | Recall@1 |                Config                |                                                    Download                                                     |
+| :--------------------------- | :--------: | :------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------------------: |
+| `blip2_3rdparty_retrieval`\* |  1173.19   |  85.40   | [config](./blip2_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip2/blip2_3rdparty_pretrain_20230505-f7ef4390.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{beitv2,
+    title={Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models},
+    author={Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven},
+    year={2023},
+    eprint={2301.12597},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/blip2/blip2-opt2.7b_8xb16_gqa.py b/configs/blip2/blip2-opt2.7b_8xb16_gqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..37fbd95e8e4b49d87f4da7b8d0f4cc7650f23dcd
--- /dev/null
+++ b/configs/blip2/blip2-opt2.7b_8xb16_gqa.py
@@ -0,0 +1,87 @@
+_base_ = [
+    '../_base_/datasets/gqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Blip2VQA',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='facebook/opt-2.7b',
+        use_fast=False),
+    vision_backbone=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=364,
+        patch_size=14,
+        out_indices=-2,
+        layer_scale_init_value=0.0,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        frozen_stages=39,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw'),
+    text_backbone=dict(
+        type='OPTForCausalLM', name_or_path='facebook/opt-2.7b'),
+    multimodal_backbone=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32),
+    vision_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=2560,
+    ),
+    prompt='Question: {} Short Answer:',
+    max_txt_len=10)
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='PackInputs', algorithm_keys=['question', 'gt_answer']),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(type='PackInputs', algorithm_keys=['question', 'gt_answer']),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip2/blip2-opt2.7b_8xb16_vqa.py b/configs/blip2/blip2-opt2.7b_8xb16_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..13a808dc224454642392142f9f6598f42e717b64
--- /dev/null
+++ b/configs/blip2/blip2-opt2.7b_8xb16_vqa.py
@@ -0,0 +1,95 @@
+_base_ = [
+    '../_base_/datasets/coco_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Blip2VQA',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='facebook/opt-2.7b',
+        use_fast=False),
+    vision_backbone=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=364,
+        patch_size=14,
+        out_indices=-2,
+        layer_scale_init_value=0.0,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        frozen_stages=39,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw'),
+    text_backbone=dict(
+        type='OPTForCausalLM', name_or_path='facebook/opt-2.7b'),
+    multimodal_backbone=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32),
+    vision_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=2560,
+    ),
+    prompt='Question: {} Answer:',
+    max_txt_len=10)
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip2/blip2-opt2.7b_8xb32_caption.py b/configs/blip2/blip2-opt2.7b_8xb32_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..52d0a63223ffdaf69730dffc2a6d4212765255a6
--- /dev/null
+++ b/configs/blip2/blip2-opt2.7b_8xb32_caption.py
@@ -0,0 +1,76 @@
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Blip2Caption',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='facebook/opt-2.7b',
+        use_fast=False),
+    vision_backbone=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=364,
+        patch_size=14,
+        out_indices=-2,
+        layer_scale_init_value=0.0,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        frozen_stages=39,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw'),
+    text_backbone=dict(
+        type='OPTForCausalLM', name_or_path='facebook/opt-2.7b'),
+    multimodal_backbone=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32),
+    vision_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=2560,
+    ),
+    prompt='a photo of',
+    max_txt_len=30)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(364, 364),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
diff --git a/configs/blip2/blip2_8xb32_retrieval.py b/configs/blip2/blip2_8xb32_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..75cb66cbfd53ac5e4e53928a65eb8617f00fb4af
--- /dev/null
+++ b/configs/blip2/blip2_8xb32_retrieval.py
@@ -0,0 +1,82 @@
+_base_ = [
+    '../_base_/datasets/coco_retrieval.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Blip2Retrieval',
+    tokenizer=dict(type='Blip2Tokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=364,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw'),
+    multimodal_backbone=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32),
+    vision_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=256,
+    ),
+    text_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=256,
+    ),
+    multimodal_head=dict(
+        type='ITMHead',
+        hidden_size=768,
+        with_pooler=False,
+    ),
+    topk=128,
+    max_txt_len=35,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(364, 364),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_text_id', 'gt_image_id'],
+        meta_keys=['image_id']),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.04)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+# learning rate scheduler
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=6)
+val_cfg = dict(type='RetrievalValLoop')
+test_cfg = dict(type='RetrievalTestLoop')
+
+randomness = dict(seed=42)
diff --git a/configs/blip2/metafile.yml b/configs/blip2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..b822103a21fa0b1b350ffbc6c5fdd6fb8ad4e8e2
--- /dev/null
+++ b/configs/blip2/metafile.yml
@@ -0,0 +1,71 @@
+Collections:
+  - Name: BLIP-2
+    Metadata:
+      Training Data:
+        - COCO
+        - VG
+        - CC3M
+        - CC12M
+        - SBU
+        - LAION-400M
+      Training Resources: 8x A100 GPUs
+      Architecture:
+        - Transformer
+        - Q-Former
+    Paper:
+      Title: 'BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
+        Encoders and Large Language Models'
+      URL: https://arxiv.org/abs/2301.12597
+    README: configs/blip2/README.md
+
+Models:
+  - Name: blip2_3rdparty_retrieval
+    Metadata:
+      FLOPs: null
+      Parameters: 1173191358
+    In Collection: BLIP-2
+    Results:
+      - Task: Image-To-Text Retrieval
+        Dataset: COCO
+        Metrics:
+          Recall@1: 85.4
+      - Task: Text-To-Image Retrieval
+        Dataset: COCO
+        Metrics:
+          Recall@1: 68.3
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip2/blip2_3rdparty_pretrain_20230505-f7ef4390.pth
+    Config: configs/blip2/blip2_8xb32_retrieval.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_opt2.7b.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip2-opt2.7b_3rdparty-zeroshot_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 3770465152
+    In Collection: BLIP-2
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 53.5
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth
+    Config: configs/blip2/blip2-opt2.7b_8xb16_vqa.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_opt2.7b.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip2-opt2.7b_3rdparty-zeroshot_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 3770465152
+    In Collection: BLIP-2
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: 32.90
+          CIDER: 111.10
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth
+    Config: configs/blip2/blip2-opt2.7b_8xb32_caption.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_opt2.7b.pth
+      Code: https://github.com/salesforce/LAVIS
diff --git a/configs/byol/README.md b/configs/byol/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2bfc8d064159ecfddaf2b2a4d0dca302b55e5f1f
--- /dev/null
+++ b/configs/byol/README.md
@@ -0,0 +1,85 @@
+# BYOL
+
+> [Bootstrap your own latent: A new approach to self-supervised Learning](https://arxiv.org/abs/2006.07733)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+**B**ootstrap **Y**our **O**wn **L**atent (BYOL) is a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/149720208-5ffbee78-1437-44c7-9ddb-b8caab60d2c3.png" width="800" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_byol-pre_8xb512-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('byol_resnet50_16xb256-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                   | Params (M) | Flops (G) |                       Config                       |                                           Download                                           |
+| :-------------------------------------- | :--------: | :-------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `byol_resnet50_16xb256-coslr-200e_in1k` |   68.02    |   4.11    | [config](byol_resnet50_16xb256-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_byol-pre_8xb512-linear-coslr-90e_in1k` | [BYOL](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth) |   25.56    |   4.11    |   71.80   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{grill2020bootstrap,
+  title={Bootstrap your own latent: A new approach to self-supervised learning},
+  author={Grill, Jean-Bastien and Strub, Florian and Altch{\'e}, Florent and Tallec, Corentin and Richemond, Pierre H and Buchatskaya, Elena and Doersch, Carl and Pires, Bernardo Avila and Guo, Zhaohan Daniel and Azar, Mohammad Gheshlaghi and others},
+  booktitle={NeurIPS},
+  year={2020}
+}
+```
diff --git a/configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py b/configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py
new file mode 100644
index 0000000000000000000000000000000000000000..4949db16a922737c5809b2c07519a6bb6867d165
--- /dev/null
+++ b/configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py
@@ -0,0 +1,46 @@
+_base_ = 'mmdet::mask_rcnn/mask-rcnn_r50-caffe-c4_1x_coco.py'
+# https://github.com/open-mmlab/mmdetection/blob/dev-3.x/configs/mask_rcnn/mask-rcnn_r50-caffe-c4_1x_coco.py
+
+data_preprocessor = dict(
+    type='DetDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    bgr_to_rgb=True,
+    pad_mask=True,
+    pad_size_divisor=32)
+
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    data_preprocessor=data_preprocessor,
+    backbone=dict(
+        frozen_stages=-1,
+        norm_cfg=norm_cfg,
+        norm_eval=False,
+        style='pytorch',
+        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
+    roi_head=dict(
+        shared_head=dict(
+            type='ResLayerExtraNorm',
+            norm_cfg=norm_cfg,
+            norm_eval=False,
+            style='pytorch')))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
+    dict(
+        type='RandomChoiceResize',
+        scales=[(1333, 640), (1333, 672), (1333, 704), (1333, 736),
+                (1333, 768), (1333, 800)],
+        keep_ratio=True),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackDetInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=12, val_interval=1)
+
+custom_imports = dict(
+    imports=['mmpretrain.models.utils.res_layer_extra_norm'],
+    allow_failed_imports=False)
diff --git a/configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py b/configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py
new file mode 100644
index 0000000000000000000000000000000000000000..1341f1508bdc400da6e79b47e1a174c0819fc79b
--- /dev/null
+++ b/configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py
@@ -0,0 +1,24 @@
+_base_ = 'mmdet::mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py'
+# https://github.com/open-mmlab/mmdetection/blob/dev-3.x/configs/mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py
+
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    backbone=dict(frozen_stages=-1, norm_cfg=norm_cfg, norm_eval=False),
+    neck=dict(norm_cfg=norm_cfg),
+    roi_head=dict(
+        bbox_head=dict(type='Shared4Conv1FCBBoxHead', norm_cfg=norm_cfg),
+        mask_head=dict(norm_cfg=norm_cfg)))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
+    dict(
+        type='RandomChoiceResize',
+        scales=[(1333, 640), (1333, 672), (1333, 704), (1333, 736),
+                (1333, 768), (1333, 800)],
+        keep_ratio=True),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackDetInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
diff --git a/configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py b/configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8dd3fd8bee88206f18d79500c401fa1f787d6e7f
--- /dev/null
+++ b/configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
@@ -0,0 +1,60 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_byol.py',
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='BYOL',
+    base_momentum=0.01,
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=2,
+        with_bias=True,
+        with_last_bn=False,
+        with_avg_pool=True),
+    head=dict(
+        type='LatentPredictHead',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=True,
+            with_last_bn=False,
+            with_avg_pool=False),
+        loss=dict(type='CosineSimilarityLoss')),
+)
+
+# optimizer
+optimizer = dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6)
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=optimizer,
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }),
+)
+
+# runtime settings
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/byol/metafile.yml b/configs/byol/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..09aacad1c580ec4ec4abe08e60dffd30eba540a8
--- /dev/null
+++ b/configs/byol/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+  - Name: BYOL
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 8x V100 GPUs (b256), 16x A100-80G GPUs (b4096)
+      Architecture:
+        - ResNet
+        - BYOL
+    Paper:
+      Title: 'Bootstrap your own latent: A new approach to self-supervised Learning'
+      URL: https://arxiv.org/abs/2006.07733
+    README: configs/byol/README.md
+
+Models:
+  - Name: byol_resnet50_16xb256-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 68024448
+      Training Data: ImageNet-1k
+    In Collection: BYOL
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth
+    Config: configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_byol-pre_8xb512-linear-coslr-90e_in1k
+  - Name: resnet50_byol-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: BYOL
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.pth
+    Config: configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/cae/README.md b/configs/cae/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dc1c818d71c35300930c4a11b5e2ed52b995cd0e
--- /dev/null
+++ b/configs/cae/README.md
@@ -0,0 +1,86 @@
+# CAE
+
+> [Context Autoencoder for Self-Supervised Representation Learning](https://arxiv.org/abs/2202.03026)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised learning. We randomly partition the image into two sets: visible patches and masked patches. The CAE architecture consists of: (i) an encoder that takes visible patches as input and outputs their latent representations, (ii) a latent context regressor that predicts the masked patch representations from the visible patch representations that are not updated in this regressor, (iii) a decoder that takes the estimated masked patch representations as input and makes predictions for the masked patches, and (iv) an alignment module that aligns the masked patch representation estimation with the masked patch representations computed from the encoder. In comparison to previous MIM methods that couple the encoding and decoding roles, e.g., using a single module in BEiT, our approach attempts to separate the encoding role (content understanding) from the decoding role (making predictions for masked patches) using different modules, improving the content understanding capability. In addition, our approach makes predictions from the visible patches to the masked patches in the latent representation space that is expected to take on semantics. In addition, we present the explanations about why contrastive pretraining and supervised pretraining perform similarly and why MIM potentially performs better. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/165459947-6c6ef13c-0593-4765-b44e-6da0a079802a.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_cae-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('cae_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                          | Params (M) | Flops (G) |                          Config                           |                                    Download                                    |
+| :--------------------------------------------- | :--------: | :-------: | :-------------------------------------------------------: | :----------------------------------------------------------------------------: |
+| `cae_beit-base-p16_8xb256-amp-coslr-300e_in1k` |   288.43   |   17.58   | [config](cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `beit-base-p16_cae-pre_8xb128-coslr-100e_in1k` | [CAE](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.pth) |   86.68    |   17.58   |   83.20   | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.json) |
+
+## Citation
+
+```bibtex
+@article{CAE,
+  title={Context Autoencoder for Self-Supervised Representation Learning},
+  author={Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo,
+  Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang},
+  journal={ArXiv},
+  year={2022}
+}
+```
diff --git a/configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e7083ce80a8311220fe6ebd5b6024c195887aa57
--- /dev/null
+++ b/configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,130 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+# CAE fine-tuning setting
+
+# dataset
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline), batch_size=128)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        final_norm=False,  # do not use final norm
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=2e-5)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=8e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..725b0f07ce71fa0ea98ae7343f0dbf47adda3ebb
--- /dev/null
+++ b/configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,115 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='TwoNormDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    second_mean=[-31.875, -31.875, -31.875],
+    second_std=[318.75, 318.75, 318.75],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        second_size=112,
+        interpolation='bicubic',
+        second_interpolation='lanczos',
+        scale=(0.08, 1.0)),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=None,
+        min_num_patches=16),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='CAE',
+    backbone=dict(
+        type='CAEPretrainViT',
+        arch='b',
+        patch_size=16,
+        layer_scale_init_value=0.1,
+        bias='qv_bias'),
+    neck=dict(
+        type='CAENeck',
+        embed_dims=768,
+        num_heads=12,
+        regressor_depth=4,
+        decoder_depth=4,
+        mlp_ratio=4,
+        layer_scale_init_value=0.1,
+    ),
+    head=dict(type='CAEHead', loss=dict(type='CAELoss', lambd=2)),
+    target_generator=dict(
+        type='DALL-E',
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa: E251
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/dalle_encoder.pth',  # noqa: E501
+        )),
+    base_momentum=0.0)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0, norm_decay_mult=0.0, flat_decay_mult=0.0))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/cae/metafile.yml b/configs/cae/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..83f46f9f810384979a0f0b4483e9ab518653bcff
--- /dev/null
+++ b/configs/cae/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+  - Name: CAE
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Context Autoencoder for Self-Supervised Representation Learning
+      URL: https://arxiv.org/abs/2202.03026
+    README: configs/cae/README.md
+
+Models:
+  - Name: cae_beit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581976064
+      Parameters: 288429952
+      Training Data: ImageNet-1k
+    In Collection: CAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.pth
+    Config: configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - beit-base-p16_cae-pre_8xb128-coslr-100e_in1k
+  - Name: beit-base-p16_cae-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581219584
+      Parameters: 86682280
+      Training Data: ImageNet-1k
+    In Collection: CAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.2
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.pth
+    Config: configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
diff --git a/configs/chinese_clip/README.md b/configs/chinese_clip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..acb37e7a2adfdf641e07a695ec064cf8507f33ed
--- /dev/null
+++ b/configs/chinese_clip/README.md
@@ -0,0 +1,69 @@
+# ChineseCLIP
+
+> [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). We have released our codes, models, and demos in https://github.com/OFA-Sys/Chinese-CLIP
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/4d05e51f-d834-4ef5-bbf0-0e2f80fea461" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model for zero-shot classification**
+
+```python
+from mmpretrain import ImageClassificationInferencer
+
+inferencer = ImageClassificationInferencer(
+    'cn-clip_resnet50_zeroshot-cls_cifar100',
+    pretrained=True,
+    classes=['鸟', '狗', '猫', '蛇'],
+    text_prototype=['鸟', '狗', '猫', '蛇'],
+)
+
+prediction = inferencer('./demo/bird.JPEG')[0]
+print('Results:', prediction['pred_class'])
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_resnet50_3rdparty_20230519-6a2b3eb2.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on CIFAR100
+
+| Model                                           | Params (M) | Top-1 (%) |                          Config                          |                                    Download                                    |
+| :---------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :----------------------------------------------------------------------------: |
+| `cn-clip_resnet50_zeroshot-cls_cifar100`\*      |   77.00    |   40.70   |   [config](cn-clip_resnet50_zeroshot-cls_cifar100.py)    | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_resnet50_3rdparty_20230519-6a2b3eb2.pth) |
+| `cn-clip_vit-base-p16_zeroshot-cls_cifar100`\*  |   188.00   |   64.50   | [config](cn-clip_vit-base-p16_zeroshot-cls_cifar100.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-base-p16_3rdparty_20230519-37fbc59e.pth) |
+| `cn-clip_vit-large-p14_zeroshot-cls_cifar100`\* |   406.00   |   74.80   | [config](cn-clip_vit-large-p14_zeroshot-cls_cifar100.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-large-p14_3rdparty_20230519-3f844503.pth) |
+| `cn-clip_vit-huge-p14_zeroshot-cls_cifar100`\*  |   958.00   |   79.10   | [config](cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-huge-p14_3rdparty_20230519-e4f49b00.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/Chinese-CLIP). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{chinese-clip,
+  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
+  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
+  journal={arXiv preprint arXiv:2211.01335},
+  year={2022}
+}
+```
diff --git a/configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..e109a5bfbb4442580aa830259a2a29f4ba11a0b5
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py
@@ -0,0 +1,72 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='ChineseCLIP',
+    vision_backbone=dict(
+        type='ModifiedResNet',
+        depth=50,
+        base_channels=64,
+        input_size=224,
+        num_attn_heads=32,
+        output_dim=1024,
+    ),
+    text_backbone=dict(
+        type='BertModelCN',
+        config=dict(
+            vocab_size=21128,
+            pad_token_id=0,
+            add_type_embeddings=True,
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            max_position_embeddings=512,
+            num_attention_heads=12,
+            num_hidden_layers=3,
+            type_vocab_size=2,
+            layer_norm_eps=1e-12)),
+    tokenizer=dict(
+        type='FullTokenizer',
+        vocab_file=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+    ),
+    proj_dim=1024,
+    text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c0ad1c9e39bcbfc615e688d5fc8c2812789989b
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py
@@ -0,0 +1,76 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='ChineseCLIP',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        final_norm=True,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+        out_type='cls_token',
+    ),
+    text_backbone=dict(
+        type='BertModelCN',
+        config=dict(
+            vocab_size=21128,
+            pad_token_id=0,
+            add_type_embeddings=True,
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            max_position_embeddings=512,
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            type_vocab_size=2,
+            layer_norm_eps=1e-12)),
+    tokenizer=dict(
+        type='FullTokenizer',
+        vocab_file=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+    ),
+    proj_dim=512,
+    text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..83aae122e8f0d2ec4fd78bb69e94feda09672980
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,75 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='ChineseCLIP',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='huge',
+        img_size=224,
+        patch_size=14,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        final_norm=True,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+        out_type='cls_token',
+    ),
+    text_backbone=dict(
+        type='BertModelCN',
+        config=dict(
+            vocab_size=21128,
+            pad_token_id=0,
+            add_type_embeddings=True,
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=1024,
+            initializer_range=0.02,
+            intermediate_size=4096,
+            max_position_embeddings=512,
+            num_attention_heads=16,
+            num_hidden_layers=24,
+            type_vocab_size=2,
+            layer_norm_eps=1e-12)),
+    tokenizer=dict(
+        type='FullTokenizer',
+        vocab_file=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+    ),
+    proj_dim=1024,
+    text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..35f0b6fb53fa2bf8d389f4a0f6ea08bdbac72175
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,75 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='ChineseCLIP',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=14,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        final_norm=True,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+        out_type='cls_token',
+    ),
+    text_backbone=dict(
+        type='BertModelCN',
+        config=dict(
+            vocab_size=21128,
+            pad_token_id=0,
+            add_type_embeddings=True,
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            max_position_embeddings=512,
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            type_vocab_size=2,
+            layer_norm_eps=1e-12)),
+    tokenizer=dict(
+        type='FullTokenizer',
+        vocab_file=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+    ),
+    proj_dim=768,
+    text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/metafile.yml b/configs/chinese_clip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40ebb49e001691c4b8adc87a9e1f24d352e41441
--- /dev/null
+++ b/configs/chinese_clip/metafile.yml
@@ -0,0 +1,79 @@
+Collections:
+  - Name: ChineseCLIP
+    Metadata:
+      Training Data:
+        - LAION-5B
+        - WuKong
+        - VisualGenome
+        - MSCOCO
+      Architecture:
+        - Transformer
+    Paper:
+      Title: 'Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese'
+      URL: https://arxiv.org/abs/2211.01335
+    README: configs/chinese_clip/README.md
+
+Models:
+  - Name: cn-clip_resnet50_zeroshot-cls_cifar100
+    Metadata:
+      FLOPs: null
+      Parameters: 77000000
+    In Collection: ChineseCLIP
+    Results:
+      - Task: Image Classification
+        Dataset: CIFAR100
+        Metrics:
+          Top 1 Accuracy: 40.7
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_resnet50_3rdparty_20230519-6a2b3eb2.pth
+    Config: configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py
+    Converted From:
+      Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_rn50.pt
+      Code: https://github.com/OFA-Sys/Chinese-CLIP
+
+  - Name: cn-clip_vit-base-p16_zeroshot-cls_cifar100
+    Metadata:
+      FLOPs: null
+      Parameters: 188000000
+    In Collection: ChineseCLIP
+    Results:
+      - Task: Image Classification
+        Dataset: CIFAR100
+        Metrics:
+          Top 1 Accuracy: 64.5
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-base-p16_3rdparty_20230519-37fbc59e.pth
+    Config: configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py
+    Converted From:
+      Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-b-16.pt
+      Code: https://github.com/OFA-Sys/Chinese-CLIP
+
+  - Name: cn-clip_vit-large-p14_zeroshot-cls_cifar100
+    Metadata:
+      FLOPs: null
+      Parameters: 406000000
+    In Collection: ChineseCLIP
+    Results:
+      - Task: Image Classification
+        Dataset: CIFAR100
+        Metrics:
+          Top 1 Accuracy: 74.8
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-large-p14_3rdparty_20230519-3f844503.pth
+    Config: configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py
+    Converted From:
+      Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-l-14.pt
+      Code: https://github.com/OFA-Sys/Chinese-CLIP
+
+  - Name: cn-clip_vit-huge-p14_zeroshot-cls_cifar100
+    Metadata:
+      FLOPs: null
+      Parameters: 958000000
+    In Collection: ChineseCLIP
+    Results:
+      - Task: Image Classification
+        Dataset: CIFAR100
+        Metrics:
+          Top 1 Accuracy: 79.1
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-huge-p14_3rdparty_20230519-e4f49b00.pth
+    Config: configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py
+    Converted From:
+      Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-h-14.pt
+      Code: https://github.com/OFA-Sys/Chinese-CLIP
diff --git a/configs/clip/README.md b/configs/clip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7a14be4d8e05fe3ba1c9d51106889b63029964b9
--- /dev/null
+++ b/configs/clip/README.md
@@ -0,0 +1,90 @@
+# CLIP
+
+> [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.
+
+<div align=center>
+<img src="https://raw.githubusercontent.com/Scarecrow0/figures_cache/main/clip_main_fig.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/clip/vit-base-p32_pt-64xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k_20221220-b384e830.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                        |         Pretrain          | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                     |                     Download                     |
+| :------------------------------------------- | :-----------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :----------------------------------------------: |
+| `vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k`\* | CLIP LAION2B ImageNet-12k |   88.22    |   4.36    |   83.06   |   96.49   |    [config](vit-base-p32_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k_20221220-b384e830.pth) |
+| `vit-base-p32_clip-laion2b-pre_3rdparty_in1k`\* |       CLIP LAION2B        |   88.22    |   4.36    |   82.46   |   96.12   |    [config](vit-base-p32_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-pre_3rdparty_in1k_20221220-194df57f.pth) |
+| `vit-base-p32_clip-openai-pre_3rdparty_in1k`\* |        CLIP OPENAI        |   88.22    |   4.36    |   81.77   |   95.89   |    [config](vit-base-p32_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-pre_3rdparty_in1k_20221220-a0182ba9.pth) |
+| `vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-384px`\* | CLIP LAION2B ImageNet-12k |   88.22    |   12.66   |   85.39   |   97.67   | [config](vit-base-p32_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-c7757552.pth) |
+| `vit-base-p32_clip-openai-in12k-pre_3rdparty_in1k-384px`\* | CLIP OPENAI ImageNet-12k  |   88.22    |   12.66   |   85.13   |   97.42   | [config](vit-base-p32_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-in12k-pre_3rdparty_in1k-384px_20221220-dc2e49ea.pth) |
+| `vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k`\* | CLIP LAION2B ImageNet-12k |   86.57    |   16.86   |   86.02   |   97.76   |    [config](vit-base-p16_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k_20221220-a5e31f8c.pth) |
+| `vit-base-p16_clip-laion2b-pre_3rdparty_in1k`\* |       CLIP LAION2B        |   86.57    |   16.86   |   85.49   |   97.59   |    [config](vit-base-p16_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k_20221220-5e24ff58.pth) |
+| `vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k`\* | CLIP OPENAI ImageNet-12k  |   86.57    |   16.86   |   85.99   |   97.72   |    [config](vit-base-p16_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k_20221220-90d930a8.pth) |
+| `vit-base-p16_clip-openai-pre_3rdparty_in1k`\* |        CLIP OPENAI        |   86.57    |   16.86   |   85.30   |   97.50   |    [config](vit-base-p16_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k_20221220-c7d9c899.pth) |
+| `vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-448px`\* | CLIP LAION2B ImageNet-12k |   88.22    |   17.20   |   85.76   |   97.63   | [config](vit-base-p32_pt-64xb64_in1k-448px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-448px_20221220-ca404a7d.pth) |
+| `vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k-384px`\* | CLIP LAION2B ImageNet-12k |   86.57    |   49.37   |   87.17   |   98.02   | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-84ed0cc0.pth) |
+| `vit-base-p16_clip-laion2b-pre_3rdparty_in1k-384px`\* |       CLIP LAION2B        |   86.57    |   49.37   |   86.52   |   97.97   | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k-384px_20221220-558ed826.pth) |
+| `vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k-384px`\* | CLIP OPENAI ImageNet-12k  |   86.57    |   49.37   |   86.87   |   98.05   | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k-384px_20221220-8df86b74.pth) |
+| `vit-base-p16_clip-openai-pre_3rdparty_in1k-384px`\* |        CLIP OPENAI        |   86.57    |   49.37   |   86.25   |   97.90   | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k-384px_20221220-eb012e87.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@InProceedings{pmlr-v139-radford21a,
+title = {Learning Transferable Visual Models From Natural Language Supervision},
+author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
+booktitle = {Proceedings of the 38th International Conference on Machine Learning},
+year = {2021},
+series = {Proceedings of Machine Learning Research},
+publisher = {PMLR},
+}
+```
diff --git a/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py b/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd684a50a319e9e2b4942ce59ae6e20744b2743e
--- /dev/null
+++ b/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py
@@ -0,0 +1,68 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=768, out_channels=512),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=512,
+        layers=12,
+        heads=8,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-base-patch16',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=512,
+    proj_dim=512,
+    text_prototype='cifar100',
+    text_prompt='openai_cifar100',
+    context_length=77,
+)
diff --git a/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py b/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..80c4fde82f514c96d9f171d6b3ed57fdbccd923a
--- /dev/null
+++ b/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py
@@ -0,0 +1,69 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=768, out_channels=512),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=512,
+        layers=12,
+        heads=8,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-base-patch16',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=512,
+    proj_dim=512,
+    text_prototype='imagenet',
+    text_prompt='openai_imagenet_sub',  # openai_imagenet, openai_imagenet_sub
+    context_length=77,
+)
diff --git a/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py b/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6dd7c1141211914c9e9835b73d0ee84a46ea3b6
--- /dev/null
+++ b/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,68 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=14,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=1024, out_channels=768),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=768,
+        layers=12,
+        heads=12,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-large-patch14',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=768,
+    proj_dim=768,
+    text_prototype='cifar100',
+    text_prompt='openai_cifar100',
+    context_length=77,
+)
diff --git a/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py b/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..10500017a9300e7c2cf8082e575378f346888c3d
--- /dev/null
+++ b/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py
@@ -0,0 +1,69 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=14,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=1024, out_channels=768),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=768,
+        layers=12,
+        heads=12,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-large-patch14',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=768,
+    proj_dim=768,
+    text_prototype='imagenet',
+    text_prompt='openai_imagenet_sub',  # openai_imagenet, openai_imagenet_sub
+    context_length=77,
+)
diff --git a/configs/clip/metafile.yml b/configs/clip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..a82eea49aa0815cf94ac9324ffaea445f815a473
--- /dev/null
+++ b/configs/clip/metafile.yml
@@ -0,0 +1,308 @@
+Collections:
+  - Name: CLIP
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: Learning Transferable Visual Models From Natural Language Supervision
+      URL: https://arxiv.org/abs/2103.00020
+    README: configs/clip/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/vision_transformer.py
+      Version: v1.0.0
+
+Models:
+  - Name: vit-base-p32_clip-openai-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 4364335104
+      Parameters: 88225000
+      Training Data:
+        - OpenAI
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.77
+          Top 5 Accuracy: 95.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-pre_3rdparty_in1k_20221220-a0182ba9.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_224.openai_ft_in1k
+  - Name: vit-base-p32_clip-laion2b-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 4364335104
+      Parameters: 88225000
+      Training Data:
+        - LAION-2B
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.46
+          Top 5 Accuracy: 96.12
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-pre_3rdparty_in1k_20221220-194df57f.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_224.laion2b_ft_in1k
+  - Name: vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 4364335104
+      Parameters: 88225000
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.06
+          Top 5 Accuracy: 96.49
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k_20221220-b384e830.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_224.laion2b_ft_in12k_in1k
+  - Name: vit-base-p32_clip-openai-in12k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 12661054464
+      Parameters: 88225000
+      Training Data:
+        - OpenAI
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.13
+          Top 5 Accuracy: 97.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-in12k-pre_3rdparty_in1k-384px_20221220-dc2e49ea.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_384.openai_ft_in12k_in1k
+  - Name: vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 12661054464
+      Parameters: 88225000
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.39
+          Top 5 Accuracy: 97.67
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-c7757552.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_384.laion2b_ft_in12k_in1k
+  - Name: vit-base-p16_clip-openai-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 16855600128
+      Parameters: 86568424
+      Training Data:
+        - OpenAI
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.3
+          Top 5 Accuracy: 97.5
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k_20221220-c7d9c899.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.openai_ft_in1k
+  - Name: vit-base-p16_clip-laion2b-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 16855600128
+      Parameters: 86568424
+      Training Data:
+        - LAION-2B
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.49
+          Top 5 Accuracy: 97.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k_20221220-5e24ff58.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.laion2b_ft_in1k
+  - Name: vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 16855600128
+      Parameters: 86568424
+      Training Data:
+        - OpenAI
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.99
+          Top 5 Accuracy: 97.72
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k_20221220-90d930a8.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.openai_ft_in12k_in1k
+  - Name: vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 16855600128
+      Parameters: 86568424
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.02
+          Top 5 Accuracy: 97.76
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k_20221220-a5e31f8c.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.laion2b_ft_in12k_in1k
+  - Name: vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 17202416640
+      Parameters: 88225000
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.76
+          Top 5 Accuracy: 97.63
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-448px_20221220-ca404a7d.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_448.laion2b_ft_in12k_in1k
+  - Name: vit-base-p16_clip-openai-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 49370078208
+      Parameters: 86568424
+      Training Data:
+        - OpenAI
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.25
+          Top 5 Accuracy: 97.9
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k-384px_20221220-eb012e87.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.openai_ft_in1k
+  - Name: vit-base-p16_clip-laion2b-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 49370078208
+      Parameters: 86568424
+      Training Data:
+        - LAION-2B
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.52
+          Top 5 Accuracy: 97.97
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k-384px_20221220-558ed826.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.laion2b_ft_in1k
+  - Name: vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 49370078208
+      Parameters: 86568424
+      Training Data:
+        - OpenAI
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.87
+          Top 5 Accuracy: 98.05
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k-384px_20221220-8df86b74.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.openai_ft_in12k_in1k
+  - Name: vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 49370078208
+      Parameters: 86568424
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.17
+          Top 5 Accuracy: 98.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-84ed0cc0.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.laion2b_ft_in12k_in1k
+  - Name: vit-large-p14_clip-openai-pre_3rdparty
+    Metadata:
+      FLOPs: 59696580608
+      Parameters: 303302656
+      Training Data:
+        - OpenAI
+    In Collection: CLIP
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth
+    Config: configs/clip/vit-large-p14_headless.py
+    Converted From:
+      Code: https://github.com/mlfoundations/open_clip
+      Weights: https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt
diff --git a/configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py b/configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..14046ce3e40cce46944ccc0ddef6c884c38d9c89
--- /dev/null
+++ b/configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=384,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p16_pt-64xb64_in1k-448px.py b/configs/clip/vit-base-p16_pt-64xb64_in1k-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..02af585753074f3a831188a01085917eb04dad4b
--- /dev/null
+++ b/configs/clip/vit-base-p16_pt-64xb64_in1k-448px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=448,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=448,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p16_pt-64xb64_in1k.py b/configs/clip/vit-base-p16_pt-64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd018bac622744bdcf6cd50821612a9148c4a85d
--- /dev/null
+++ b/configs/clip/vit-base-p16_pt-64xb64_in1k.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py b/configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1acf78ab6bf335cc0e3cd1012fbe7773336c61e
--- /dev/null
+++ b/configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=384,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py b/configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f50391f15bb1dc60b94d5ef163f4e88e3b4e509
--- /dev/null
+++ b/configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=448,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=448,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p32_pt-64xb64_in1k.py b/configs/clip/vit-base-p32_pt-64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..abbb50089edb9057504e7571bd29fddaa1c53dc9
--- /dev/null
+++ b/configs/clip/vit-base-p32_pt-64xb64_in1k.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-large-p14_headless.py b/configs/clip/vit-large-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9b965d4f0edc4794b05a3ea6a917a0d350a27f3
--- /dev/null
+++ b/configs/clip/vit-large-p14_headless.py
@@ -0,0 +1,34 @@
+_base_ = ['../_base_/default_runtime.py']
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='l',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        pre_norm=True,
+    ),
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+test_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = None
diff --git a/configs/conformer/README.md b/configs/conformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..04b5d4770b22c346a149dfc0bf7c1dfc2713a2a6
--- /dev/null
+++ b/configs/conformer/README.md
@@ -0,0 +1,84 @@
+# Conformer
+
+> [Conformer: Local Features Coupling Global Representations for Visual Recognition](https://arxiv.org/abs/2105.03889)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/144957687-926390ed-6119-4e4c-beaa-9bc0017fe953.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('conformer-tiny-p16_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('conformer-tiny-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/conformer/conformer-small-p32_8xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/conformer/conformer-tiny-p16_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-p16_3rdparty_8xb128_in1k_20211206-f6860372.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                 |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                    Config                    |                                Download                                |
+| :------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------: | :--------------------------------------------------------------------: |
+| `conformer-tiny-p16_3rdparty_in1k`\*  | From scratch |   23.52    |   4.90    |   81.31   |   95.60   | [config](conformer-tiny-p16_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-p16_3rdparty_8xb128_in1k_20211206-f6860372.pth) |
+| `conformer-small-p16_3rdparty_in1k`\* | From scratch |   37.67    |   10.31   |   83.32   |   96.46   | [config](conformer-small-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p16_3rdparty_8xb128_in1k_20211206-3065dcf5.pth) |
+| `conformer-small-p32_8xb128_in1k`     | From scratch |   38.85    |   7.09    |   81.96   |   96.02   | [config](conformer-small-p32_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p32_8xb128_in1k_20211206-947a0816.pth) |
+| `conformer-base-p16_3rdparty_in1k`\*  | From scratch |   83.29    |   22.89   |   83.82   |   96.59   | [config](conformer-base-p16_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-base-p16_3rdparty_8xb128_in1k_20211206-bfdf8637.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pengzhiliang/Conformer/blob/main/models.py#L89). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{peng2021conformer,
+      title={Conformer: Local Features Coupling Global Representations for Visual Recognition},
+      author={Zhiliang Peng and Wei Huang and Shanzhi Gu and Lingxi Xie and Yaowei Wang and Jianbin Jiao and Qixiang Ye},
+      journal={arXiv preprint arXiv:2105.03889},
+      year={2021},
+}
+```
diff --git a/configs/conformer/conformer-base-p16_8xb128_in1k.py b/configs/conformer/conformer-base-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a44f56f3ac3213c616a6e960ce2476466eb65bbd
--- /dev/null
+++ b/configs/conformer/conformer-base-p16_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/conformer/base-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+    '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/conformer-small-p16_8xb128_in1k.py b/configs/conformer/conformer-small-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a937f4f9e60c3987a6ff3d2b7320a0dd49855cbc
--- /dev/null
+++ b/configs/conformer/conformer-small-p16_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/conformer/small-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+    '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/conformer-small-p32_8xb128_in1k.py b/configs/conformer/conformer-small-p32_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b07ce2ce3fba146675b7a8453cc581f2a011db1
--- /dev/null
+++ b/configs/conformer/conformer-small-p32_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/conformer/small-p32.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+    '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/conformer-tiny-p16_8xb128_in1k.py b/configs/conformer/conformer-tiny-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f88c6c3b0da3c50e0b3ccb2454b200dfbaf7c4c7
--- /dev/null
+++ b/configs/conformer/conformer-tiny-p16_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/conformer/tiny-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+    '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/metafile.yml b/configs/conformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..c0821bad059c32db978f02c4935a41ec0c054c16
--- /dev/null
+++ b/configs/conformer/metafile.yml
@@ -0,0 +1,78 @@
+Collections:
+  - Name: Conformer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Layer Normalization
+        - Scaled Dot-Product Attention
+        - Dropout
+    Paper:
+      URL: https://arxiv.org/abs/2105.03889
+      Title: "Conformer: Local Features Coupling Global Representations for Visual Recognition"
+    README: configs/conformer/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.19.0/mmcls/models/backbones/conformer.py
+      Version: v0.19.0
+
+Models:
+  - Name: conformer-tiny-p16_3rdparty_in1k
+    In Collection: Conformer
+    Config: configs/conformer/conformer-tiny-p16_8xb128_in1k.py
+    Metadata:
+      FLOPs: 4899611328
+      Parameters: 23524704
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.31
+          Top 5 Accuracy: 95.60
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-p16_3rdparty_8xb128_in1k_20211206-f6860372.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/19SxGhKcWOR5oQSxNUWUM2MGYiaWMrF1z/view?usp=sharing
+      Code: https://github.com/pengzhiliang/Conformer/blob/main/models.py#L65
+  - Name: conformer-small-p16_3rdparty_in1k
+    In Collection: Conformer
+    Config: configs/conformer/conformer-small-p16_8xb128_in1k.py
+    Metadata:
+      FLOPs: 10311309312
+      Parameters: 37673424
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.32
+          Top 5 Accuracy: 96.46
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p16_3rdparty_8xb128_in1k_20211206-3065dcf5.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1mpOlbLaVxOfEwV4-ha78j_1Ebqzj2B83/view?usp=sharing
+      Code: https://github.com/pengzhiliang/Conformer/blob/main/models.py#L73
+  - Name: conformer-small-p32_8xb128_in1k
+    In Collection: Conformer
+    Config: configs/conformer/conformer-small-p32_8xb128_in1k.py
+    Metadata:
+      FLOPs: 7087281792
+      Parameters: 38853072
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.96
+          Top 5 Accuracy: 96.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p32_8xb128_in1k_20211206-947a0816.pth
+  - Name: conformer-base-p16_3rdparty_in1k
+    In Collection: Conformer
+    Config: configs/conformer/conformer-base-p16_8xb128_in1k.py
+    Metadata:
+      FLOPs: 22892078080
+      Parameters: 83289136
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.82
+          Top 5 Accuracy: 96.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-base-p16_3rdparty_8xb128_in1k_20211206-bfdf8637.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1oeQ9LSOGKEUaYGu7WTlUGl3KDsQIi0MA/view?usp=sharing
+      Code: https://github.com/pengzhiliang/Conformer/blob/main/models.py#L89
diff --git a/configs/convmixer/README.md b/configs/convmixer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a87d27ffb8ec0dd6a6182d99133a227b0b29945b
--- /dev/null
+++ b/configs/convmixer/README.md
@@ -0,0 +1,79 @@
+# ConvMixer
+
+> [Patches Are All You Need?](https://arxiv.org/abs/2201.09792)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/156284977-abf2245e-d9ba-4e0d-8e10-c0664a20f4c8.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('convmixer-768-32_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('convmixer-768-32_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/convmixer/convmixer-768-32_10xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                               |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                   |                                  Download                                  |
+| :---------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------: |
+| `convmixer-768-32_3rdparty_in1k`\*  | From scratch |   21.11    |   19.62   |   80.16   |   95.08   | [config](convmixer-768-32_10xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth) |
+| `convmixer-1024-20_3rdparty_in1k`\* | From scratch |   24.38    |   5.55    |   76.94   |   93.36   | [config](convmixer-1024-20_10xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1024-20_3rdparty_10xb64_in1k_20220323-48f8aeba.pth) |
+| `convmixer-1536-20_3rdparty_in1k`\* | From scratch |   51.63    |   48.71   |   81.37   |   95.61   | [config](convmixer-1536-20_10xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1536_20_3rdparty_10xb64_in1k_20220323-ea5786f3.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/locuslab/convmixer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{trockman2022patches,
+      title={Patches Are All You Need?},
+      author={Asher Trockman and J. Zico Kolter},
+      year={2022},
+      eprint={2201.09792},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/convmixer/convmixer-1024-20_10xb64_in1k.py b/configs/convmixer/convmixer-1024-20_10xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0dbc664261e2244cb35a779211c45b5b854d4cc5
--- /dev/null
+++ b/configs/convmixer/convmixer-1024-20_10xb64_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/models/convmixer/convmixer-1024-20.py',
+    '../_base_/datasets/imagenet_bs64_convmixer_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=0.01),
+    clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=130,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=20,
+        end=150)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=150)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (10 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=640)
diff --git a/configs/convmixer/convmixer-1536-20_10xb64_in1k.py b/configs/convmixer/convmixer-1536-20_10xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3c8cc95c20312311ee06cee911dc186944de5b7f
--- /dev/null
+++ b/configs/convmixer/convmixer-1536-20_10xb64_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/models/convmixer/convmixer-1536-20.py',
+    '../_base_/datasets/imagenet_bs64_convmixer_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=0.01),
+    clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=130,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=20,
+        end=150)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=150)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (10 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=640)
diff --git a/configs/convmixer/convmixer-768-32_10xb64_in1k.py b/configs/convmixer/convmixer-768-32_10xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d872d4429134ef8c88ea87da3c93b6532472423e
--- /dev/null
+++ b/configs/convmixer/convmixer-768-32_10xb64_in1k.py
@@ -0,0 +1,19 @@
+_base_ = [
+    '../_base_/models/convmixer/convmixer-768-32.py',
+    '../_base_/datasets/imagenet_bs64_convmixer_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=0.01),
+    clip_grad=dict(max_norm=5.0),
+)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (10 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=640)
diff --git a/configs/convmixer/metafile.yml b/configs/convmixer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f9dcdc7cc71ddc72791ab47666c0a35d30a9f349
--- /dev/null
+++ b/configs/convmixer/metafile.yml
@@ -0,0 +1,61 @@
+Collections:
+  - Name: ConvMixer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - LayerScale
+    Paper:
+      URL: https://arxiv.org/abs/2201.09792
+      Title: Patches Are All You Need?
+    README: configs/convmixer/README.md
+
+Models:
+  - Name: convmixer-768-32_3rdparty_in1k
+    Metadata:
+      FLOPs: 19623051264
+      Parameters: 21110248
+    In Collection: ConvMixer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.16
+          Top 5 Accuracy: 95.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth
+    Config: configs/convmixer/convmixer-768-32_10xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/tmp-iclr/convmixer/releases/download/v1.0/convmixer_768_32_ks7_p7_relu.pth.tar
+      Code: https://github.com/locuslab/convmixer
+  - Name: convmixer-1024-20_3rdparty_in1k
+    Metadata:
+      FLOPs: 5550112768
+      Parameters: 24383464
+    In Collection: ConvMixer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.94
+          Top 5 Accuracy: 93.36
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1024-20_3rdparty_10xb64_in1k_20220323-48f8aeba.pth
+    Config: configs/convmixer/convmixer-1024-20_10xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/tmp-iclr/convmixer/releases/download/v1.0/convmixer_1024_20_ks9_p14.pth.tar
+      Code: https://github.com/locuslab/convmixer
+  - Name: convmixer-1536-20_3rdparty_in1k
+    Metadata:
+      FLOPs: 48713170944
+      Parameters: 51625960
+    In Collection: ConvMixer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.37
+          Top 5 Accuracy: 95.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1536_20_3rdparty_10xb64_in1k_20220323-ea5786f3.pth
+    Config: configs/convmixer/convmixer-1536-20_10xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/tmp-iclr/convmixer/releases/download/v1.0/convmixer_1536_20_ks9_p7.pth.tar
+      Code: https://github.com/locuslab/convmixer
diff --git a/configs/convnext/README.md b/configs/convnext/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2e6e14c2f2e65af68c1f8177bdec91f70a0b3149
--- /dev/null
+++ b/configs/convnext/README.md
@@ -0,0 +1,123 @@
+# ConvNeXt
+
+> [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545v1)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**ConvNeXt** is initially described in [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545v1), which is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers. The ConvNeXt has the pyramid structure and achieve competitive  performance on various vision tasks, with simplicity and efficiency.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/8370623/148624004-e9581042-ea4d-4e10-b3bd-42c92b02053b.png" width="100%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('convnext-tiny_32xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('convnext-tiny_32xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/convnext/convnext-tiny_32xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/convnext/convnext-tiny_32xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                              | Params (M) | Flops (G) |                  Config                   |                                                  Download                                                  |
+| :--------------------------------- | :--------: | :-------: | :---------------------------------------: | :--------------------------------------------------------------------------------------------------------: |
+| `convnext-base_3rdparty_in21k`\*   |   88.59    |   15.36   | [config](convnext-base_32xb128_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in21k_20220124-13b83eec.pth) |
+| `convnext-large_3rdparty_in21k`\*  |   197.77   |   34.37   | [config](convnext-large_64xb64_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in21k_20220124-41b5a79f.pth) |
+| `convnext-xlarge_3rdparty_in21k`\* |   350.20   |   60.93   | [config](convnext-xlarge_64xb64_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_3rdparty_in21k_20220124-f909bad7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                     |                         Download                         |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :------------------------------------------------------: |
+| `convnext-tiny_32xb128_in1k`                      | From scratch |   28.59    |   4.46    |   82.14   |   96.06   |    [config](convnext-tiny_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.json) |
+| `convnext-tiny_32xb128-noema_in1k`                | From scratch |   28.59    |   4.46    |   81.95   |   95.89   |    [config](convnext-tiny_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128-noema_in1k_20221208-5d4509c7.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.json) |
+| `convnext-tiny_in21k-pre_3rdparty_in1k`\*         | ImageNet-21k |   28.59    |   4.46    |   82.90   |   96.62   |    [config](convnext-tiny_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k_20221219-7501e534.pth) |
+| `convnext-tiny_in21k-pre_3rdparty_in1k-384px`\*   | ImageNet-21k |   28.59    |   13.14   |   84.11   |   97.14   | [config](convnext-tiny_32xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k-384px_20221219-c1182362.pth) |
+| `convnext-small_32xb128_in1k`                     | From scratch |   50.22    |   8.69    |   83.16   |   96.56   |    [config](convnext-small_32xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.json) |
+| `convnext-small_32xb128-noema_in1k`               | From scratch |   50.22    |   8.69    |   83.21   |   96.48   |    [config](convnext-small_32xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128-noema_in1k_20221208-4a618995.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.json) |
+| `convnext-small_in21k-pre_3rdparty_in1k`\*        | ImageNet-21k |   50.22    |   8.69    |   84.59   |   97.41   |    [config](convnext-small_32xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k_20221219-aeca4c93.pth) |
+| `convnext-small_in21k-pre_3rdparty_in1k-384px`\*  | ImageNet-21k |   50.22    |   25.58   |   85.75   |   97.88   | [config](convnext-small_32xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k-384px_20221219-96f0bb87.pth) |
+| `convnext-base_32xb128_in1k`                      | From scratch |   88.59    |   15.36   |   83.66   |   96.74   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.json) |
+| `convnext-base_32xb128-noema_in1k`                | From scratch |   88.59    |   15.36   |   83.64   |   96.61   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128-noema_in1k_20221208-f8182678.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.json) |
+| `convnext-base_3rdparty_in1k`\*                   | From scratch |   88.59    |   15.36   |   83.85   |   96.74   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128_in1k_20220124-d0915162.pth) |
+| `convnext-base_3rdparty-noema_in1k`\*             | From scratch |   88.59    |   15.36   |   83.71   |   96.60   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128-noema_in1k_20220222-dba4f95f.pth) |
+| `convnext-base_3rdparty_in1k-384px`\*             | From scratch |   88.59    |   45.21   |   85.10   |   97.34   | [config](convnext-base_32xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in1k-384px_20221219-c8f1dc2b.pth) |
+| `convnext-base_in21k-pre_3rdparty_in1k`\*         | ImageNet-21k |   88.59    |   15.36   |   85.81   |   97.86   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_32xb128_in1k_20220124-eb2d6ada.pth) |
+| `convnext-base_in21k-pre-3rdparty_in1k-384px`\*   | From scratch |   88.59    |   45.21   |   86.82   |   98.25   | [config](convnext-base_32xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_in1k-384px_20221219-4570f792.pth) |
+| `convnext-large_3rdparty_in1k`\*                  | From scratch |   197.77   |   34.37   |   84.30   |   96.89   |    [config](convnext-large_64xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_64xb64_in1k_20220124-f8a0ded0.pth) |
+| `convnext-large_3rdparty_in1k-384px`\*            | From scratch |   197.77   |  101.10   |   85.50   |   97.59   | [config](convnext-large_64xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in1k-384px_20221219-6dd29d10.pth) |
+| `convnext-large_in21k-pre_3rdparty_in1k`\*        | ImageNet-21k |   197.77   |   34.37   |   86.61   |   98.04   |    [config](convnext-large_64xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_64xb64_in1k_20220124-2412403d.pth) |
+| `convnext-large_in21k-pre-3rdparty_in1k-384px`\*  | From scratch |   197.77   |  101.10   |   87.46   |   98.37   | [config](convnext-large_64xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_in1k-384px_20221219-6d38dd66.pth) |
+| `convnext-xlarge_in21k-pre_3rdparty_in1k`\*       | ImageNet-21k |   350.20   |   60.93   |   86.97   |   98.20   |    [config](convnext-xlarge_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_64xb64_in1k_20220124-76b6863d.pth) |
+| `convnext-xlarge_in21k-pre-3rdparty_in1k-384px`\* | From scratch |   350.20   |  179.20   |   87.76   |   98.55   | [config](convnext-xlarge_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_in1k-384px_20221219-b161bc14.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@Article{liu2022convnet,
+  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
+  title   = {A ConvNet for the 2020s},
+  journal = {arXiv preprint arXiv:2201.03545},
+  year    = {2022},
+}
+```
diff --git a/configs/convnext/convnext-base_32xb128_in1k-384px.py b/configs/convnext/convnext-base_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..65546942562ac17b3d4510c78d3090aa8b87a831
--- /dev/null
+++ b/configs/convnext/convnext-base_32xb128_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-base_32xb128_in1k.py b/configs/convnext/convnext-base_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ae8ec47c4c7ac3f22712c97dbad315c7a798e6f
--- /dev/null
+++ b/configs/convnext/convnext-base_32xb128_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-base_32xb128_in21k.py b/configs/convnext/convnext-base_32xb128_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c343526c7f084501fc3651c1581752209f5019a4
--- /dev/null
+++ b/configs/convnext/convnext-base_32xb128_in21k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor = dict(num_classes=21841)
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-large_64xb64_in1k-384px.py b/configs/convnext/convnext-large_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6698b9edcdae463d6d1cf943237efbaf236cd71c
--- /dev/null
+++ b/configs/convnext/convnext-large_64xb64_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-large.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-large_64xb64_in1k.py b/configs/convnext/convnext-large_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a78c58bc3d85e0e08083d339378886f870388bc
--- /dev/null
+++ b/configs/convnext/convnext-large_64xb64_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-large.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-large_64xb64_in21k.py b/configs/convnext/convnext-large_64xb64_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..420edab67b1dc094f08b4a3810af522b2a988b62
--- /dev/null
+++ b/configs/convnext/convnext-large_64xb64_in21k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor = dict(num_classes=21841)
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-small_32xb128_in1k-384px.py b/configs/convnext/convnext-small_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..729f00ad2fdf53943ffae9de165e2e9985e733c7
--- /dev/null
+++ b/configs/convnext/convnext-small_32xb128_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-small.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-small_32xb128_in1k.py b/configs/convnext/convnext-small_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b623e900f830fbea7891b61c737398c0dee1076e
--- /dev/null
+++ b/configs/convnext/convnext-small_32xb128_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-small.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-tiny_32xb128_in1k-384px.py b/configs/convnext/convnext-tiny_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6513ad8dfa41714ecb5c9de5992496716337c596
--- /dev/null
+++ b/configs/convnext/convnext-tiny_32xb128_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-tiny_32xb128_in1k.py b/configs/convnext/convnext-tiny_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..59d3004bde89510b5c44110c8a6513957c0cbba0
--- /dev/null
+++ b/configs/convnext/convnext-tiny_32xb128_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py b/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6edc94d2448157fc82bf38a988bf4393f192a89f
--- /dev/null
+++ b/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-xlarge.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-xlarge_64xb64_in1k.py b/configs/convnext/convnext-xlarge_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..528894e808b7085ee66d8be89cf84f860ddec979
--- /dev/null
+++ b/configs/convnext/convnext-xlarge_64xb64_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-xlarge.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-xlarge_64xb64_in21k.py b/configs/convnext/convnext-xlarge_64xb64_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..420edab67b1dc094f08b4a3810af522b2a988b62
--- /dev/null
+++ b/configs/convnext/convnext-xlarge_64xb64_in21k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor = dict(num_classes=21841)
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/metafile.yml b/configs/convnext/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..16896629f07ffadd5313a6e38bc1532ddc3c08f2
--- /dev/null
+++ b/configs/convnext/metafile.yml
@@ -0,0 +1,410 @@
+Collections:
+  - Name: ConvNeXt
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - LayerScale
+    Paper:
+      URL: https://arxiv.org/abs/2201.03545v1
+      Title: A ConvNet for the 2020s
+    README: configs/convnext/README.md
+    Code:
+      Version: v0.20.1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/convnext.py
+
+Models:
+  - Name: convnext-tiny_32xb128_in1k
+    Metadata:
+      FLOPs: 4457472768
+      Parameters: 28589128
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.14
+          Top 5 Accuracy: 96.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.pth
+    Config: configs/convnext/convnext-tiny_32xb128_in1k.py
+  - Name: convnext-tiny_32xb128-noema_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 4457472768
+      Parameters: 28589128
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.95
+          Top 5 Accuracy: 95.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128-noema_in1k_20221208-5d4509c7.pth
+    Config: configs/convnext/convnext-tiny_32xb128_in1k.py
+  - Name: convnext-tiny_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 4457472768
+      Parameters: 28589128
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.90
+          Top 5 Accuracy: 96.62
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k_20221219-7501e534.pth
+    Config: configs/convnext/convnext-tiny_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_tiny_22k_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-tiny_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 13135236864
+      Parameters: 28589128
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.11
+          Top 5 Accuracy: 97.14
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k-384px_20221219-c1182362.pth
+    Config: configs/convnext/convnext-tiny_32xb128_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_tiny_22k_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-small_32xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 8687008512
+      Parameters: 50223688
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.16
+          Top 5 Accuracy: 96.56
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.pth
+    Config: configs/convnext/convnext-small_32xb128_in1k.py
+  - Name: convnext-small_32xb128-noema_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 8687008512
+      Parameters: 50223688
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.21
+          Top 5 Accuracy: 96.48
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128-noema_in1k_20221208-4a618995.pth
+    Config: configs/convnext/convnext-small_32xb128_in1k.py
+  - Name: convnext-small_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 8687008512
+      Parameters: 50223688
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.59
+          Top 5 Accuracy: 97.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k_20221219-aeca4c93.pth
+    Config: configs/convnext/convnext-small_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-small_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 25580818176
+      Parameters: 50223688
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.75
+          Top 5 Accuracy: 97.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k-384px_20221219-96f0bb87.pth
+    Config: configs/convnext/convnext-small_32xb128_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_32xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.66
+          Top 5 Accuracy: 96.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+  - Name: convnext-base_32xb128-noema_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.64
+          Top 5 Accuracy: 96.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128-noema_in1k_20221208-f8182678.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+  - Name: convnext-base_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.85
+          Top 5 Accuracy: 96.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128_in1k_20220124-d0915162.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224_ema.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_3rdparty-noema_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.71
+          Top 5 Accuracy: 96.60
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128-noema_in1k_20220222-dba4f95f.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_3rdparty_in1k-384px
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 45205885952
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.10
+          Top 5 Accuracy: 97.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in1k-384px_20221219-c8f1dc2b.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_3rdparty_in21k
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in21k_20220124-13b83eec.pth
+    Config: configs/convnext/convnext-base_32xb128_in21k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.81
+          Top 5 Accuracy: 97.86
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_32xb128_in1k_20220124-eb2d6ada.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_in21k-pre-3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 45205885952
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.82
+          Top 5 Accuracy: 98.25
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_in1k-384px_20221219-4570f792.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 34368026112
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.30
+          Top 5 Accuracy: 96.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_64xb64_in1k_20220124-f8a0ded0.pth
+    Config: configs/convnext/convnext-large_64xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_1k_224_ema.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_3rdparty_in1k-384px
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 101103214080
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.50
+          Top 5 Accuracy: 97.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in1k-384px_20221219-6dd29d10.pth
+    Config: configs/convnext/convnext-large_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_3rdparty_in21k
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 34368026112
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in21k_20220124-41b5a79f.pth
+    Config: configs/convnext/convnext-large_64xb64_in21k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 34368026112
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.61
+          Top 5 Accuracy: 98.04
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_64xb64_in1k_20220124-2412403d.pth
+    Config: configs/convnext/convnext-large_64xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_in21k-pre-3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 101103214080
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.46
+          Top 5 Accuracy: 98.37
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_in1k-384px_20221219-6d38dd66.pth
+    Config: configs/convnext/convnext-large_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-xlarge_3rdparty_in21k
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 60929820672
+      Parameters: 350196968
+    In Collection: ConvNeXt
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_3rdparty_in21k_20220124-f909bad7.pth
+    Config: configs/convnext/convnext-xlarge_64xb64_in21k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-xlarge_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 60929820672
+      Parameters: 350196968
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.97
+          Top 5 Accuracy: 98.20
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_64xb64_in1k_20220124-76b6863d.pth
+    Config: configs/convnext/convnext-xlarge_64xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_1k_224_ema.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-xlarge_in21k-pre-3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 179196798976
+      Parameters: 350196968
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.76
+          Top 5 Accuracy: 98.55
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_in1k-384px_20221219-b161bc14.pth
+    Config: configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_1k_384_ema.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
diff --git a/configs/convnext_v2/README.md b/configs/convnext_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e561387412aa3a8e088cb7d015e7b98dba8e50c1
--- /dev/null
+++ b/configs/convnext_v2/README.md
@@ -0,0 +1,107 @@
+# ConvNeXt V2
+
+> [Co-designing and Scaling ConvNets with Masked Autoencoders](http://arxiv.org/abs/2301.00808)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/210496285-f235083f-218f-4153-8e21-c8a64481a2f5.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('convnext-v2-atto_fcmae-pre_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('convnext-v2-atto_3rdparty-fcmae_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_fcmae-pre_3rdparty_in1k_20230104-23765f83.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                     | Params (M) | Flops (G) |                   Config                   |                                              Download                                              |
+| :---------------------------------------- | :--------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------------------------------: |
+| `convnext-v2-atto_3rdparty-fcmae_in1k`\*  |    3.71    |   0.55    | [config](convnext-v2-atto_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_3rdparty-fcmae_in1k_20230104-07514db4.pth) |
+| `convnext-v2-femto_3rdparty-fcmae_in1k`\* |    5.23    |   0.78    | [config](convnext-v2-femto_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_3rdparty-fcmae_in1k_20230104-adbe2082.pth) |
+| `convnext-v2-pico_3rdparty-fcmae_in1k`\*  |    9.07    |   1.37    | [config](convnext-v2-pico_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_3rdparty-fcmae_in1k_20230104-147b1b59.pth) |
+| `convnext-v2-nano_3rdparty-fcmae_in1k`\*  |   15.62    |   2.45    | [config](convnext-v2-nano_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_3rdparty-fcmae_in1k_20230104-3dd1f29e.pth) |
+| `convnext-v2-tiny_3rdparty-fcmae_in1k`\*  |   28.64    |   4.47    | [config](convnext-v2-tiny_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_3rdparty-fcmae_in1k_20230104-80513adc.pth) |
+| `convnext-v2-base_3rdparty-fcmae_in1k`\*  |   88.72    |   15.38   | [config](convnext-v2-base_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_3rdparty-fcmae_in1k_20230104-8a798eaf.pth) |
+| `convnext-v2-large_3rdparty-fcmae_in1k`\* |   197.96   |   34.40   | [config](convnext-v2-large_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_3rdparty-fcmae_in1k_20230104-bf38df92.pth) |
+| `convnext-v2-huge_3rdparty-fcmae_in1k`\*  |   660.29   |  115.00   | [config](convnext-v2-huge_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_3rdparty-fcmae_in1k_20230104-fe43ae6c.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt-V2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                           |      Pretrain      | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                      Config                      |                      Download                      |
+| :---------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------------: | :------------------------------------------------: |
+| `convnext-v2-atto_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |    3.71    |   0.55    |   76.64   |   93.04   |    [config](convnext-v2-atto_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_fcmae-pre_3rdparty_in1k_20230104-23765f83.pth) |
+| `convnext-v2-femto_fcmae-pre_3rdparty_in1k`\*   |       FCMAE        |    5.23    |   0.78    |   78.48   |   93.98   |    [config](convnext-v2-femto_32xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_fcmae-pre_3rdparty_in1k_20230104-92a75d75.pth) |
+| `convnext-v2-pico_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |    9.07    |   1.37    |   80.31   |   95.08   |    [config](convnext-v2-pico_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_fcmae-pre_3rdparty_in1k_20230104-d20263ca.pth) |
+| `convnext-v2-nano_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |   15.62    |   2.45    |   81.86   |   95.75   |    [config](convnext-v2-nano_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-pre_3rdparty_in1k_20230104-fe1aaaf2.pth) |
+| `convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k |   15.62    |   2.45    |   82.04   |   96.16   |    [config](convnext-v2-nano_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k_20230104-91fa8ae2.pth) |
+| `convnext-v2-tiny_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |   28.64    |   4.47    |   82.94   |   96.29   |    [config](convnext-v2-tiny_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-pre_3rdparty_in1k_20230104-471a86de.pth) |
+| `convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k |   28.64    |   4.47    |   83.89   |   96.96   |    [config](convnext-v2-tiny_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k_20230104-8cc8b8f2.pth) |
+| `convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   15.62    |   7.21    |   83.36   |   96.75   | [config](convnext-v2-nano_32xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-f951ae87.pth) |
+| `convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   28.64    |   13.14   |   85.09   |   97.63   | [config](convnext-v2-tiny_32xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-d8579f84.pth) |
+| `convnext-v2-base_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |   88.72    |   15.38   |   84.87   |   97.08   |    [config](convnext-v2-base_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-pre_3rdparty_in1k_20230104-00a70fa4.pth) |
+| `convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k |   88.72    |   15.38   |   86.74   |   98.02   |    [config](convnext-v2-base_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k_20230104-c48d16a5.pth) |
+| `convnext-v2-large_fcmae-pre_3rdparty_in1k`\*   |       FCMAE        |   197.96   |   34.40   |   85.76   |   97.59   |    [config](convnext-v2-large_32xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-pre_3rdparty_in1k_20230104-ef393013.pth) |
+| `convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k |   197.96   |   34.40   |   87.26   |   98.24   |    [config](convnext-v2-large_32xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k_20230104-d9c4dc0c.pth) |
+| `convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   88.72    |   45.21   |   87.63   |   98.42   | [config](convnext-v2-base_32xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-379425cc.pth) |
+| `convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   197.96   |  101.10   |   88.18   |   98.52   | [config](convnext-v2-large_32xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-9139a1f3.pth) |
+| `convnext-v2-huge_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |   660.29   |  115.00   |   86.25   |   97.75   |    [config](convnext-v2-huge_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-pre_3rdparty_in1k_20230104-f795e5b8.pth) |
+| `convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   660.29   |  337.96   |   88.68   |   98.73   | [config](convnext-v2-huge_32xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-02a4eb35.pth) |
+| `convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px`\* | FCMAE ImageNet-21k |   660.29   |  600.81   |   88.86   |   98.74   | [config](convnext-v2-huge_32xb32_in1k-512px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px_20230104-ce32e63c.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt-V2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Woo2023ConvNeXtV2,
+  title={ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders},
+  author={Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon and Saining Xie},
+  year={2023},
+  journal={arXiv preprint arXiv:2301.00808},
+}
+```
diff --git a/configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..68f34c9634e3390bb3c600351ef37e9a94c6d575
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/atto.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..70b7f18e0c9dfa92791ff1a8a77553680de673e7
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/base.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-base_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b66b375eb3a3872842b4fdf72285db36a76dc3b8
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..053e19478fe75dac91b616fa314f4fbdd2667c61
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/femto.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b734b271ef9a7ada6085c14465a43ee05841b348
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/huge.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c63b023be3cbcca94e0847ed88febfd1b099223
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py
@@ -0,0 +1,54 @@
+_base_ = [
+    '../_base_/models/convnext_v2/huge.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=512,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=512, backend='pillow', interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=32, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..18621f3aeb86c1a8ad620d71625c2952ca145320
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/huge.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b08b12eb0507b2582fe237b498c97f57452e29ec
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/large.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-large_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9695d08e9c63bae6f440a427c07ddb68b08403b
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/large.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9b36dc59229e0dba661211c3570771453f54113
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/nano.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a7c9e3e629522b42b9ff4d02a479b4688a74b92
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/nano.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2cc52ff252972724d4d6737dda1e784abc4d536
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/pico.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a19fd6cc670c33726187d41cef41ff33e69d8edd
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=3.2e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=40,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=40)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c6fbd0f2cd4189fb1699959cf8d63228a1ab3515
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=3.2e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=40,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=40)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/metafile.yml b/configs/convnext_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..86baa586ec6824603351cc70348c219f68fa71a2
--- /dev/null
+++ b/configs/convnext_v2/metafile.yml
@@ -0,0 +1,433 @@
+Collections:
+  - Name: ConvNeXt V2
+    Metadata:
+      Architecture:
+        - Global Response Normalization
+    Paper:
+      Title: Co-designing and Scaling ConvNets with Masked Autoencoders
+      URL: http://arxiv.org/abs/2301.00808
+    README: configs/convnext_v2/README.md
+
+Models:
+  - Name: convnext-v2-atto_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 551718080
+      Parameters: 3708400
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_3rdparty-fcmae_in1k_20230104-07514db4.pth
+    Config: configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_atto_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-atto_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 551718080
+      Parameters: 3708400
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.64
+          Top 5 Accuracy: 93.04
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_fcmae-pre_3rdparty_in1k_20230104-23765f83.pth
+    Config: configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_atto_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-femto_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 784892544
+      Parameters: 5233240
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_3rdparty-fcmae_in1k_20230104-adbe2082.pth
+    Config: configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_femto_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-femto_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 784892544
+      Parameters: 5233240
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.48
+          Top 5 Accuracy: 93.98
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_fcmae-pre_3rdparty_in1k_20230104-92a75d75.pth
+    Config: configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_femto_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-pico_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1374072320
+      Parameters: 9066280
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_3rdparty-fcmae_in1k_20230104-147b1b59.pth
+    Config: configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_pico_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-pico_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1374072320
+      Parameters: 9066280
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.31
+          Top 5 Accuracy: 95.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_fcmae-pre_3rdparty_in1k_20230104-d20263ca.pth
+    Config: configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_pico_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-nano_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 2454926720
+      Parameters: 15623800
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_3rdparty-fcmae_in1k_20230104-3dd1f29e.pth
+    Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_nano_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-nano_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 2454926720
+      Parameters: 15623800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.86
+          Top 5 Accuracy: 95.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-pre_3rdparty_in1k_20230104-fe1aaaf2.pth
+    Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_nano_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 2454926720
+      Parameters: 15623800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.04
+          Top 5 Accuracy: 96.16
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k_20230104-91fa8ae2.pth
+    Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_nano_22k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-tiny_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 4469631744
+      Parameters: 28635496
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_3rdparty-fcmae_in1k_20230104-80513adc.pth
+    Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_tiny_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-tiny_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 4469631744
+      Parameters: 28635496
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.94
+          Top 5 Accuracy: 96.29
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-pre_3rdparty_in1k_20230104-471a86de.pth
+    Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_tiny_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 4469631744
+      Parameters: 28635496
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.89
+          Top 5 Accuracy: 96.96
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k_20230104-8cc8b8f2.pth
+    Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_tiny_22k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 7214472320
+      Parameters: 15623800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.36
+          Top 5 Accuracy: 96.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-f951ae87.pth
+    Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_nano_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 13135236864
+      Parameters: 28635496
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.09
+          Top 5 Accuracy: 97.63
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-d8579f84.pth
+    Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_tiny_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-base_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15382561792
+      Parameters: 88717800
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_3rdparty-fcmae_in1k_20230104-8a798eaf.pth
+    Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_base_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-base_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15382561792
+      Parameters: 88717800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.87
+          Top 5 Accuracy: 97.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-pre_3rdparty_in1k_20230104-00a70fa4.pth
+    Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_base_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 15382561792
+      Parameters: 88717800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.74
+          Top 5 Accuracy: 98.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k_20230104-c48d16a5.pth
+    Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_base_22k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-large_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 34403182080
+      Parameters: 197956840
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_3rdparty-fcmae_in1k_20230104-bf38df92.pth
+    Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_large_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-large_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 34403182080
+      Parameters: 197956840
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.76
+          Top 5 Accuracy: 97.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-pre_3rdparty_in1k_20230104-ef393013.pth
+    Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_large_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 34403182080
+      Parameters: 197956840
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.26
+          Top 5 Accuracy: 98.24
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k_20230104-d9c4dc0c.pth
+    Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_large_22k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 45205885952
+      Parameters: 88717800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.63
+          Top 5 Accuracy: 98.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-379425cc.pth
+    Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_base_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 101103214080
+      Parameters: 197956840
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.18
+          Top 5 Accuracy: 98.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-9139a1f3.pth
+    Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_large_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-huge_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 114998639360
+      Parameters: 660289640
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_3rdparty-fcmae_in1k_20230104-fe43ae6c.pth
+    Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_huge_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-huge_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 114998639360
+      Parameters: 660289640
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.25
+          Top 5 Accuracy: 97.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-pre_3rdparty_in1k_20230104-f795e5b8.pth
+    Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_huge_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 337955157760
+      Parameters: 660289640
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.68
+          Top 5 Accuracy: 98.73
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-02a4eb35.pth
+    Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_huge_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 600809158400
+      Parameters: 660289640
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.86
+          Top 5 Accuracy: 98.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px_20230104-ce32e63c.pth
+    Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_huge_22k_512_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
diff --git a/configs/cspnet/README.md b/configs/cspnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f3b145ba0399b660d03233d9deb11913fbc3c438
--- /dev/null
+++ b/configs/cspnet/README.md
@@ -0,0 +1,78 @@
+# CSPNet
+
+> [CSPNet: A New Backbone that can Enhance Learning Capability of CNN](https://arxiv.org/abs/1911.11929)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet. Source code is at this https URL.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/159420842-6147c687-a488-460c-8bb2-4ea5276c26c7.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('cspdarknet50_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('cspdarknet50_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/cspnet/cspdarknet50_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-bd275287.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                    Download                                     |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :-----------------------------------------------------------------------------: |
+| `cspdarknet50_3rdparty_8xb32_in1k`\* | From scratch |   27.64    |   5.04    |   80.05   |   95.07   | [config](cspdarknet50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-bd275287.pth) |
+| `cspresnet50_3rdparty_8xb32_in1k`\*  | From scratch |   21.62    |   3.48    |   79.55   |   94.68   | [config](cspresnet50_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnet50_3rdparty_8xb32_in1k_20220329-dd6dddfb.pth) |
+| `cspresnext50_3rdparty_8xb32_in1k`\* | From scratch |   20.57    |   3.11    |   79.96   |   94.96   | [config](cspresnext50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnext50_3rdparty_8xb32_in1k_20220329-2cc84d21.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/rwightman/pytorch-image-models). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{wang2020cspnet,
+  title={CSPNet: A new backbone that can enhance learning capability of CNN},
+  author={Wang, Chien-Yao and Liao, Hong-Yuan Mark and Wu, Yueh-Hua and Chen, Ping-Yang and Hsieh, Jun-Wei and Yeh, I-Hau},
+  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops},
+  pages={390--391},
+  year={2020}
+}
+```
diff --git a/configs/cspnet/cspdarknet50_8xb32_in1k.py b/configs/cspnet/cspdarknet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..851148109e72202cd5eca721fb66023ab2934e90
--- /dev/null
+++ b/configs/cspnet/cspdarknet50_8xb32_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='CSPDarkNet', depth=53),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=288,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/cspnet/cspresnet50_8xb32_in1k.py b/configs/cspnet/cspresnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d149637aabae7b8cdf691262796becc4cfcc5efc
--- /dev/null
+++ b/configs/cspnet/cspresnet50_8xb32_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='CSPResNet', depth=50),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=288,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/cspnet/cspresnext50_8xb32_in1k.py b/configs/cspnet/cspresnext50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f8c15c12f6ab42349eda2a3680f07eabb855448
--- /dev/null
+++ b/configs/cspnet/cspresnext50_8xb32_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='CSPResNeXt', depth=50),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/cspnet/metafile.yml b/configs/cspnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..31036325f6e9c96c574a303f60990e28fe7822b9
--- /dev/null
+++ b/configs/cspnet/metafile.yml
@@ -0,0 +1,64 @@
+Collections:
+  - Name: CSPNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Cross Stage Partia Stage
+    Paper:
+      URL: https://arxiv.org/abs/1911.11929
+      Title: 'CSPNet: A New Backbone that can Enhance Learning Capability of CNN'
+    README: configs/cspnet/README.md
+    Code:
+      Version: v0.22.0
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.22.0/mmcls/models/backbones/cspnet.py
+
+Models:
+  - Name: cspdarknet50_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 5040000000
+      Parameters: 27640000
+    In Collection: CSPNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.05
+          Top 5 Accuracy: 95.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-bd275287.pth
+    Config: configs/cspnet/cspdarknet50_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/cspdarknet53_ra_256-d05c7c21.pth
+      Code: https://github.com/rwightman/pytorch-image-models
+  - Name: cspresnet50_3rdparty_8xb32_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3480000000
+      Parameters: 21620000
+    In Collection: CSPNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.55
+          Top 5 Accuracy: 94.68
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnet50_3rdparty_8xb32_in1k_20220329-dd6dddfb.pth
+    Config: configs/cspnet/cspresnet50_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/cspresnet50_ra-d3e8d487.pth
+      Code: https://github.com/rwightman/pytorch-image-models
+  - Name: cspresnext50_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 3110000000
+      Parameters: 20570000
+    In Collection: CSPNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.96
+          Top 5 Accuracy: 94.96
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnext50_3rdparty_8xb32_in1k_20220329-2cc84d21.pth
+    Config: configs/cspnet/cspresnext50_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/cspresnext50_ra_224-648b4713.pth
+      Code: https://github.com/rwightman/pytorch-image-models
diff --git a/configs/csra/README.md b/configs/csra/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..99b29571c9e602d501518c0fdfcd490cee83f183
--- /dev/null
+++ b/configs/csra/README.md
@@ -0,0 +1,73 @@
+# CSRA
+
+> [Residual Attention: A Simple but Effective Method for Multi-Label Recognition](https://arxiv.org/abs/2108.02456)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Multi-label image recognition is a challenging computer vision task of practical use. Progresses in this area, however, are often characterized by complicated methods, heavy computations, and lack of intuitive explanations. To effectively capture different spatial regions occupied by objects from different categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). CSRA generates class-specific features for every category by proposing a simple spatial attention score, and then combines it with the class-agnostic average pooling feature. CSRA achieves state-of-the-art results on multilabel recognition, and at the same time is much simpler than them. Furthermore, with only 4 lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets without any extra training. CSRA is both easy to implement and light in computations, which also enjoys intuitive explanations and visualizations.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/84259897/176982245-3ffcff56-a4ea-4474-9967-bc2b612bbaa3.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet101-csra_1xb16_voc07-448px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/csra/resnet101-csra_1xb16_voc07-448px.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/csra/resnet101-csra_1xb16_voc07-448px.py https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Multi-Label Classification on PASCAL VOC 2007
+
+| Model                              |   Pretrain   | Params (M) | Flops (G) |  CF1  |  OF1  |  mAP  |                    Config                     |                                  Download                                   |
+| :--------------------------------- | :----------: | :--------: | :-------: | :---: | :---: | :---: | :-------------------------------------------: | :-------------------------------------------------------------------------: |
+| `resnet101-csra_1xb16_voc07-448px` | From scratch |   23.55    |   4.12    | 89.16 | 90.80 | 94.98 | [config](resnet101-csra_1xb16_voc07-448px.py) | [model](https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.json) |
+
+## Citation
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2108.02456,
+  doi = {10.48550/ARXIV.2108.02456},
+  url = {https://arxiv.org/abs/2108.02456},
+  author = {Zhu, Ke and Wu, Jianxin},
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {Residual Attention: A Simple but Effective Method for Multi-Label Recognition},
+  publisher = {arXiv},
+  year = {2021},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```
diff --git a/configs/csra/metafile.yml b/configs/csra/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..112f50c9d44e1bc12359653f89920b93eae67361
--- /dev/null
+++ b/configs/csra/metafile.yml
@@ -0,0 +1,29 @@
+Collections:
+  - Name: CSRA
+    Metadata:
+      Training Data: PASCAL VOC 2007
+      Architecture:
+        - Class-specific Residual Attention
+    Paper:
+      URL: https://arxiv.org/abs/2108.02456
+      Title: 'Residual Attention: A Simple but Effective Method for Multi-Label Recognition'
+    README: configs/csra/README.md
+    Code:
+      Version: v0.24.0
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.24.0/mmcls/models/heads/multi_label_csra_head.py
+
+Models:
+  - Name: resnet101-csra_1xb16_voc07-448px
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 23550000
+    In Collection: CSRA
+    Results:
+      - Dataset: PASCAL VOC 2007
+        Metrics:
+          mAP: 94.98
+          OF1: 90.80
+          CF1: 89.16
+        Task: Multi-Label Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.pth
+    Config: configs/csra/resnet101-csra_1xb16_voc07-448px.py
diff --git a/configs/csra/resnet101-csra_1xb16_voc07-448px.py b/configs/csra/resnet101-csra_1xb16_voc07-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..85135ae215c072accb4038b1a3fb4b3b796a6072
--- /dev/null
+++ b/configs/csra/resnet101-csra_1xb16_voc07-448px.py
@@ -0,0 +1,78 @@
+_base_ = ['../_base_/datasets/voc_bs16.py', '../_base_/default_runtime.py']
+
+# Pre-trained Checkpoint Path
+checkpoint = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.pth'  # noqa
+# If you want to use the pre-trained weight of ResNet101-CutMix from the
+# originary repo(https://github.com/Kevinz-code/CSRA). Script of
+# 'tools/model_converters/torchvision_to_mmpretrain.py' can help you convert
+# weight into mmpretrain format. The mAP result would hit 95.5 by using the
+# weight. checkpoint = 'PATH/TO/PRE-TRAINED_WEIGHT'
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch',
+        init_cfg=dict(
+            type='Pretrained', checkpoint=checkpoint, prefix='backbone')),
+    neck=None,
+    head=dict(
+        type='CSRAClsHead',
+        num_classes=20,
+        in_channels=2048,
+        num_heads=1,
+        lam=0.1,
+        loss=dict(type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)))
+
+# dataset setting
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255])
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=448, crop_ratio_range=(0.7, 1.0)),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=448),
+    dict(
+        type='PackInputs',
+        # `gt_label_difficult` is needed for VOC evaluation
+        meta_keys=('sample_idx', 'img_path', 'ori_shape', 'img_shape',
+                   'scale_factor', 'flip', 'flip_direction',
+                   'gt_label_difficult')),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer
+# the lr of classifier.head is 10 * base_lr, which help convergence.
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.0002, momentum=0.9, weight_decay=0.0001),
+    paramwise_cfg=dict(custom_keys={'head': dict(lr_mult=10)}))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-7,
+        by_epoch=True,
+        begin=0,
+        end=1,
+        convert_to_iter_based=True),
+    dict(type='StepLR', by_epoch=True, step_size=6, gamma=0.1)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/davit/README.md b/configs/davit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1be19d98e37d4bf75dcc3d89ce689d09512b0505
--- /dev/null
+++ b/configs/davit/README.md
@@ -0,0 +1,77 @@
+# DaViT
+
+> [DaViT: Dual Attention Vision Transformers](https://arxiv.org/abs/2204.03645v1)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/196125065-e232409b-f710-4729-b657-4e5f9158f2d1.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('davit-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('davit-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/davit/davit-tiny_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                        Download                                        |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `davit-tiny_3rdparty_in1k`\*  | From scratch |   28.36    |   4.54    |   82.24   |   96.13   | [config](davit-tiny_4xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d.pth) |
+| `davit-small_3rdparty_in1k`\* | From scratch |   49.75    |   8.80    |   83.61   |   96.75   | [config](davit-small_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/davit/davit-small_3rdparty_in1k_20221116-51a849a6.pth) |
+| `davit-base_3rdparty_in1k`\*  | From scratch |   87.95    |   15.51   |   84.09   |   96.82   | [config](davit-base_4xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/davit/davit-base_3rdparty_in1k_20221116-19e0d956.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{ding2022davit,
+    title={DaViT: Dual Attention Vision Transformer},
+    author={Ding, Mingyu and Xiao, Bin and Codella, Noel and Luo, Ping and Wang, Jingdong and Yuan, Lu},
+    booktitle={ECCV},
+    year={2022},
+}
+```
diff --git a/configs/davit/davit-base_4xb256_in1k.py b/configs/davit/davit-base_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..071702fa7b69a3d893d9999ecf9ace28afbe193d
--- /dev/null
+++ b/configs/davit/davit-base_4xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/davit/davit-base.py',
+    '../_base_/datasets/imagenet_bs256_davit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# data settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/davit/davit-small_4xb256_in1k.py b/configs/davit/davit-small_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e341031016c53b57adb477093f89b4524c6db4c1
--- /dev/null
+++ b/configs/davit/davit-small_4xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/davit/davit-small.py',
+    '../_base_/datasets/imagenet_bs256_davit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# data settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/davit/davit-tiny_4xb256_in1k.py b/configs/davit/davit-tiny_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a16d87f4630b73fd4d76b52bbe926cb75dbb1d03
--- /dev/null
+++ b/configs/davit/davit-tiny_4xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/davit/davit-tiny.py',
+    '../_base_/datasets/imagenet_bs256_davit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# data settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/davit/metafile.yml b/configs/davit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..588c18fd6dade71ff114a724a42a68a1a38b72bc
--- /dev/null
+++ b/configs/davit/metafile.yml
@@ -0,0 +1,71 @@
+Collections:
+  - Name: DaViT
+    Metadata:
+      Architecture:
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      URL: https://arxiv.org/abs/2204.03645v1
+      Title: 'DaViT: Dual Attention Vision Transformers'
+    README: configs/davit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/backbones/davit.py
+      Version: v1.0.0rc3
+
+Models:
+  - Name: davit-tiny_3rdparty_in1k
+    In Collection: DaViT
+    Metadata:
+      FLOPs: 4539698688
+      Parameters: 28360168
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 82.24
+        Top 5 Accuracy: 96.13
+    Weights: https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1RSpi3lxKaloOL5-or20HuG975tbPwxRZ/view?usp=sharing
+      Code: https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355
+    Config: configs/davit/davit-tiny_4xb256_in1k.py
+  - Name: davit-small_3rdparty_in1k
+    In Collection: DaViT
+    Metadata:
+      FLOPs: 8799942144
+      Parameters: 49745896
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.61
+        Top 5 Accuracy: 96.75
+    Weights: https://download.openmmlab.com/mmclassification/v0/davit/davit-small_3rdparty_in1k_20221116-51a849a6.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1q976ruj45mt0RhO9oxhOo6EP_cmj4ahQ/view?usp=sharing
+      Code: https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355
+    Config: configs/davit/davit-small_4xb256_in1k.py
+  - Name: davit-base_3rdparty_in1k
+    In Collection: DaViT
+    Metadata:
+      FLOPs: 15509702656
+      Parameters: 87954408
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.09
+        Top 5 Accuracy: 96.82
+    Weights: https://download.openmmlab.com/mmclassification/v0/davit/davit-base_3rdparty_in1k_20221116-19e0d956.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1u9sDBEueB-YFuLigvcwf4b2YyA4MIVsZ/view?usp=sharing
+      Code: https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355
+    Config: configs/davit/davit-base_4xb256_in1k.py
diff --git a/configs/deit/README.md b/configs/deit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ee434140a4316fed147c171ea425b6deff2aead6
--- /dev/null
+++ b/configs/deit/README.md
@@ -0,0 +1,97 @@
+# DeiT
+
+> [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.   In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.   More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/143225703-c287c29e-82c9-4c85-a366-dfae30d198cd.png" width="40%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('deit-tiny_4xb256_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('deit-tiny_4xb256_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/deit/deit-tiny_4xb256_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/deit/deit-tiny_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                       Config                       |                       Download                       |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------------: | :--------------------------------------------------: |
+| `deit-tiny_4xb256_in1k`                           | From scratch |    5.72    |   1.26    |   74.50   |   92.24   |         [config](deit-tiny_4xb256_in1k.py)         | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.json) |
+| `deit-tiny-distilled_3rdparty_in1k`\*             | From scratch |    5.91    |   1.27    |   74.51   |   91.90   |    [config](deit-tiny-distilled_4xb256_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny-distilled_3rdparty_pt-4xb256_in1k_20211216-c429839a.pth) |
+| `deit-small_4xb256_in1k`                          | From scratch |   22.05    |   4.61    |   80.69   |   95.06   |        [config](deit-small_4xb256_in1k.py)         | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb.json) |
+| `deit-small-distilled_3rdparty_in1k`\*            | From scratch |   22.44    |   4.63    |   81.17   |   95.40   |   [config](deit-small-distilled_4xb256_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-small-distilled_3rdparty_pt-4xb256_in1k_20211216-4de1d725.pth) |
+| `deit-base_16xb64_in1k`                           | From scratch |   86.57    |   17.58   |   81.76   |   95.81   |         [config](deit-base_16xb64_in1k.py)         | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c.json) |
+| `deit-base_3rdparty_in1k`\*                       | From scratch |   86.57    |   17.58   |   81.79   |   95.59   |         [config](deit-base_16xb64_in1k.py)         | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_pt-16xb64_in1k_20211124-6f40c188.pth) |
+| `deit-base-distilled_3rdparty_in1k`\*             | From scratch |   87.34    |   17.67   |   83.33   |   96.49   |    [config](deit-base-distilled_16xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_pt-16xb64_in1k_20211216-42891296.pth) |
+| `deit-base_224px-pre_3rdparty_in1k-384px`\*       |    224px     |   86.86    |   55.54   |   83.04   |   96.31   |      [config](deit-base_16xb32_in1k-384px.py)      | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_ft-16xb32_in1k-384px_20211124-822d02f2.pth) |
+| `deit-base-distilled_224px-pre_3rdparty_in1k-384px`\* |    224px     |   87.63    |   55.65   |   85.55   |   97.35   | [config](deit-base-distilled_16xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_ft-16xb32_in1k-384px_20211216-e48d6000.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L168). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+```{warning}
+MMPretrain doesn't support training the distilled version DeiT.
+And we provide distilled version checkpoints for inference only.
+```
+
+## Citation
+
+```bibtex
+@InProceedings{pmlr-v139-touvron21a,
+  title =     {Training data-efficient image transformers &amp; distillation through attention},
+  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
+  booktitle = {International Conference on Machine Learning},
+  pages =     {10347--10357},
+  year =      {2021},
+  volume =    {139},
+  month =     {July}
+}
+```
diff --git a/configs/deit/deit-base-distilled_16xb32_in1k-384px.py b/configs/deit/deit-base-distilled_16xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..60d3112fd530917d2196a24c25d8d0223731c52d
--- /dev/null
+++ b/configs/deit/deit-base-distilled_16xb32_in1k-384px.py
@@ -0,0 +1,37 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DistilledVisionTransformer',
+        arch='deit-base',
+        img_size=384,
+        patch_size=16,
+    ),
+    neck=None,
+    head=dict(
+        type='DeiTClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    # Change to the path of the pretrained model
+    # init_cfg=dict(type='Pretrained', checkpoint=''),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (16 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/deit/deit-base-distilled_16xb64_in1k.py b/configs/deit/deit-base-distilled_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..207bf250f62f3317df6535cf9b7e8dd0b4a1f5ac
--- /dev/null
+++ b/configs/deit/deit-base-distilled_16xb64_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DistilledVisionTransformer',
+        arch='deit-base',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='DeiTClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-base_16xb32_in1k-384px.py b/configs/deit/deit-base_16xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..762b4604348d1e8f0940f0243c9c824215d4b207
--- /dev/null
+++ b/configs/deit/deit-base_16xb32_in1k-384px.py
@@ -0,0 +1,37 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='deit-base',
+        img_size=384,
+        patch_size=16,
+    ),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    # Change to the path of the pretrained model
+    # init_cfg=dict(type='Pretrained', checkpoint=''),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (16 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/deit/deit-base_16xb64_in1k.py b/configs/deit/deit-base_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..66f03a99f20a10649a954c15b2aa9c44374704fe
--- /dev/null
+++ b/configs/deit/deit-base_16xb64_in1k.py
@@ -0,0 +1,50 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='deit-base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/deit/deit-small-distilled_4xb256_in1k.py b/configs/deit/deit-small-distilled_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c7c58cb3d76e8b36f766080e4ec7de056a0621b
--- /dev/null
+++ b/configs/deit/deit-small-distilled_4xb256_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DistilledVisionTransformer',
+        arch='deit-small',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='DeiTClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-small_4xb256_in1k.py b/configs/deit/deit-small_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b96d84ec46bf2badd08b69fddaa2d8b8109b1ebf
--- /dev/null
+++ b/configs/deit/deit-small_4xb256_in1k.py
@@ -0,0 +1,48 @@
+# In small and tiny arch, remove drop path and EMA hook comparing with the
+# original config
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='deit-small',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-tiny-distilled_4xb256_in1k.py b/configs/deit/deit-tiny-distilled_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..00a9c4bd214a7c3d3eb1163b73aeb70251ce1bbc
--- /dev/null
+++ b/configs/deit/deit-tiny-distilled_4xb256_in1k.py
@@ -0,0 +1,47 @@
+# The distillation config is only for evaluation.
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DistilledVisionTransformer',
+        arch='deit-tiny',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='DeiTClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-tiny_4xb256_in1k.py b/configs/deit/deit-tiny_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..486669e9c16e01ccc3d469c55bb04e714225b624
--- /dev/null
+++ b/configs/deit/deit-tiny_4xb256_in1k.py
@@ -0,0 +1,48 @@
+# In small and tiny arch, remove drop path and EMA hook comparing with the
+# original config
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='deit-tiny',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/metafile.yml b/configs/deit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f6f0c5e56a4f72fc7df812705b9d2ec4a6a589bb
--- /dev/null
+++ b/configs/deit/metafile.yml
@@ -0,0 +1,153 @@
+Collections:
+  - Name: DeiT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Layer Normalization
+        - Scaled Dot-Product Attention
+        - Attention Dropout
+        - Multi-Head Attention
+    Paper:
+      Title: Training data-efficient image transformers & distillation through attention
+      URL: https://arxiv.org/abs/2012.12877
+    README: configs/deit/README.md
+    Code:
+      URL: v0.19.0
+      Version: https://github.com/open-mmlab/mmpretrain/blob/v0.19.0/mmcls/models/backbones/deit.py
+
+Models:
+  - Name: deit-tiny_4xb256_in1k
+    Metadata:
+      FLOPs: 1258219200
+      Parameters: 5717416
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.5
+          Top 5 Accuracy: 92.24
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth
+    Config: configs/deit/deit-tiny_4xb256_in1k.py
+  - Name: deit-tiny-distilled_3rdparty_in1k
+    Metadata:
+      FLOPs: 1265371776
+      Parameters: 5910800
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.51
+          Top 5 Accuracy: 91.9
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny-distilled_3rdparty_pt-4xb256_in1k_20211216-c429839a.pth
+    Config: configs/deit/deit-tiny-distilled_4xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_tiny_distilled_patch16_224-b40b3cf7.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L108
+  - Name: deit-small_4xb256_in1k
+    Metadata:
+      FLOPs: 4607954304
+      Parameters: 22050664
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.69
+          Top 5 Accuracy: 95.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb.pth
+    Config: configs/deit/deit-small_4xb256_in1k.py
+  - Name: deit-small-distilled_3rdparty_in1k
+    Metadata:
+      FLOPs: 4632876288
+      Parameters: 22436432
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.17
+          Top 5 Accuracy: 95.4
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-small-distilled_3rdparty_pt-4xb256_in1k_20211216-4de1d725.pth
+    Config: configs/deit/deit-small-distilled_4xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_small_distilled_patch16_224-649709d9.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L123
+  - Name: deit-base_16xb64_in1k
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86567656
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.76
+          Top 5 Accuracy: 95.81
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c.pth
+    Config: configs/deit/deit-base_16xb64_in1k.py
+  - Name: deit-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86567656
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.79
+          Top 5 Accuracy: 95.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_pt-16xb64_in1k_20211124-6f40c188.pth
+    Config: configs/deit/deit-base_16xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L93
+  - Name: deit-base-distilled_3rdparty_in1k
+    Metadata:
+      FLOPs: 17674283520
+      Parameters: 87338192
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.33
+          Top 5 Accuracy: 96.49
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_pt-16xb64_in1k_20211216-42891296.pth
+    Config: configs/deit/deit-base-distilled_16xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_224-df68dfff.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L138
+  - Name: deit-base_224px-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 55538974464
+      Parameters: 86859496
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.04
+          Top 5 Accuracy: 96.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_ft-16xb32_in1k-384px_20211124-822d02f2.pth
+    Config: configs/deit/deit-base_16xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_base_patch16_384-8de9b5d1.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L153
+  - Name: deit-base-distilled_224px-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 55645294080
+      Parameters: 87630032
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.55
+          Top 5 Accuracy: 97.35
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_ft-16xb32_in1k-384px_20211216-e48d6000.pth
+    Config: configs/deit/deit-base-distilled_16xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_384-d0272ac0.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L168
diff --git a/configs/deit3/README.md b/configs/deit3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..18694b7eb9b97589aece3c9bfc7187b9c9d83841
--- /dev/null
+++ b/configs/deit3/README.md
@@ -0,0 +1,90 @@
+# DeiT III: Revenge of the ViT
+
+> [DeiT III: Revenge of the ViT](https://arxiv.org/abs/2204.07118)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/192964480-46726469-21d9-4e45-a06a-87c6a57c3367.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('deit3-small-p16_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('deit3-small-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/deit3/deit3-small-p16_64xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k_20221008-0f7c70cf.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                     |                         Download                         |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :------------------------------------------------------: |
+| `deit3-small-p16_3rdparty_in1k`\*                 | From scratch |   22.06    |   4.61    |   81.35   |   95.31   |    [config](deit3-small-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k_20221008-0f7c70cf.pth) |
+| `deit3-small-p16_3rdparty_in1k-384px`\*           | From scratch |   22.21    |   15.52   |   83.43   |   96.68   | [config](deit3-small-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k-384px_20221008-a2c1a0c7.pth) |
+| `deit3-small-p16_in21k-pre_3rdparty_in1k`\*       | ImageNet-21k |   22.06    |   4.61    |   83.06   |   96.77   |    [config](deit3-small-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k_20221009-dcd90827.pth) |
+| `deit3-small-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   22.21    |   15.52   |   84.84   |   97.48   | [config](deit3-small-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k-384px_20221009-de116dd7.pth) |
+| `deit3-medium-p16_3rdparty_in1k`\*                | From scratch |   38.85    |   8.00    |   82.99   |   96.22   |   [config](deit3-medium-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_3rdparty_in1k_20221008-3b21284d.pth) |
+| `deit3-medium-p16_in21k-pre_3rdparty_in1k`\*      | ImageNet-21k |   38.85    |   8.00    |   84.56   |   97.19   |   [config](deit3-medium-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_in21k-pre_3rdparty_in1k_20221009-472f11e2.pth) |
+| `deit3-base-p16_3rdparty_in1k`\*                  | From scratch |   86.59    |   17.58   |   83.80   |   96.55   |    [config](deit3-base-p16_64xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k_20221008-60b8c8bf.pth) |
+| `deit3-base-p16_3rdparty_in1k-384px`\*            | From scratch |   86.88    |   55.54   |   85.08   |   97.25   | [config](deit3-base-p16_64xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k-384px_20221009-e19e36d4.pth) |
+| `deit3-base-p16_in21k-pre_3rdparty_in1k`\*        | ImageNet-21k |   86.59    |   17.58   |   85.70   |   97.75   |    [config](deit3-base-p16_64xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k_20221009-87983ca1.pth) |
+| `deit3-base-p16_in21k-pre_3rdparty_in1k-384px`\*  | ImageNet-21k |   86.88    |   55.54   |   86.73   |   98.11   | [config](deit3-base-p16_64xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k-384px_20221009-5e4e37b9.pth) |
+| `deit3-large-p16_3rdparty_in1k`\*                 | From scratch |   304.37   |   61.60   |   84.87   |   97.01   |    [config](deit3-large-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k_20221009-03b427ea.pth) |
+| `deit3-large-p16_3rdparty_in1k-384px`\*           | From scratch |   304.76   |  191.21   |   85.82   |   97.60   | [config](deit3-large-p16_64xb16_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k-384px_20221009-4317ce62.pth) |
+| `deit3-large-p16_in21k-pre_3rdparty_in1k`\*       | ImageNet-21k |   304.37   |   61.60   |   86.97   |   98.24   |    [config](deit3-large-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k_20221009-d8d27084.pth) |
+| `deit3-large-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   304.76   |  191.21   |   87.73   |   98.51   | [config](deit3-large-p16_64xb16_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k-384px_20221009-75fea03f.pth) |
+| `deit3-huge-p14_3rdparty_in1k`\*                  | From scratch |   632.13   |  167.40   |   85.21   |   97.36   |    [config](deit3-huge-p14_64xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_3rdparty_in1k_20221009-e107bcb7.pth) |
+| `deit3-huge-p14_in21k-pre_3rdparty_in1k`\*        | ImageNet-21k |   632.13   |  167.40   |   87.19   |   98.26   |    [config](deit3-huge-p14_64xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_in21k-pre_3rdparty_in1k_20221009-19b8a535.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Touvron2022DeiTIR,
+  title={DeiT III: Revenge of the ViT},
+  author={Hugo Touvron and Matthieu Cord and Herve Jegou},
+  journal={arXiv preprint arXiv:2204.07118},
+  year={2022},
+}
+```
diff --git a/configs/deit3/deit3-base-p16_64xb32_in1k-384px.py b/configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6c8a8c411ee96a88bc44c042cdf134a36eb05da
--- /dev/null
+++ b/configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-base-p16-384.py',
+    '../_base_/datasets/imagenet_bs64_deit3_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/deit3/deit3-base-p16_64xb64_in1k.py b/configs/deit3/deit3-base-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c69a64cdd06da1e868bb08e9eec5cbf9b82f5aa9
--- /dev/null
+++ b/configs/deit3/deit3-base-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-base-p16-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-huge-p14_64xb32_in1k.py b/configs/deit3/deit3-huge-p14_64xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8cae075b6a28f8519390983621b2dc98173e507
--- /dev/null
+++ b/configs/deit3/deit3-huge-p14_64xb32_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-huge-p14-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/deit3/deit3-large-p16_64xb16_in1k-384px.py b/configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..84fb0feae636a3f3c4b2297ed6935e817701cbea
--- /dev/null
+++ b/configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-large-p16-384.py',
+    '../_base_/datasets/imagenet_bs64_deit3_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=16)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (16 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1025)
diff --git a/configs/deit3/deit3-large-p16_64xb64_in1k.py b/configs/deit3/deit3-large-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a67ac21f9ba3fefdb7e22429e565fb6ee6eeff86
--- /dev/null
+++ b/configs/deit3/deit3-large-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-large-p16-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-medium-p16_64xb64_in1k.py b/configs/deit3/deit3-medium-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..def48e682a5fa66e166f4419b8e1850e26f75d17
--- /dev/null
+++ b/configs/deit3/deit3-medium-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-medium-p16-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-small-p16_64xb64_in1k-384px.py b/configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6b3e892c34268d2bdfeb9f7ab7f1808ea203558
--- /dev/null
+++ b/configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-small-p16-384.py',
+    '../_base_/datasets/imagenet_bs64_deit3_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-small-p16_64xb64_in1k.py b/configs/deit3/deit3-small-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..58b0a2f1837e09edc3c43d6776fda169e4b0480b
--- /dev/null
+++ b/configs/deit3/deit3-small-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-small-p16-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/metafile.yml b/configs/deit3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6f50fdc396c017fcbf3d2542f6fe52c0ed5bf546
--- /dev/null
+++ b/configs/deit3/metafile.yml
@@ -0,0 +1,310 @@
+Collections:
+  - Name: DeiT3
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      URL: https://arxiv.org/abs/2204.07118
+      Title: 'DeiT III: Revenge of the ViT'
+    README: configs/deit3/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc2/mmcls/models/backbones/deit3.py
+      Version: v1.0.0rc2
+
+Models:
+  - Name: deit3-small-p16_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 4607954304
+      Parameters: 22059496
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 81.35
+        Top 5 Accuracy: 95.31
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k_20221008-0f7c70cf.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-small-p16_64xb64_in1k.py
+  - Name: deit3-small-p16_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 15517663104
+      Parameters: 22205416
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.43
+        Top 5 Accuracy: 96.68
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k-384px_20221008-a2c1a0c7.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_384_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
+  - Name: deit3-small-p16_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 4607954304
+      Parameters: 22059496
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.06
+        Top 5 Accuracy: 96.77
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k_20221009-dcd90827.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_224_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-small-p16_64xb64_in1k.py
+  - Name: deit3-small-p16_in21k-pre_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 15517663104
+      Parameters: 22205416
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.84
+        Top 5 Accuracy: 97.48
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k-384px_20221009-de116dd7.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_384_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
+  - Name: deit3-medium-p16_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 8003064320
+      Parameters: 38849512
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 82.99
+        Top 5 Accuracy: 96.22
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_3rdparty_in1k_20221008-3b21284d.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_medium_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-medium-p16_64xb64_in1k.py
+  - Name: deit3-medium-p16_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 8003064320
+      Parameters: 38849512
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.56
+        Top 5 Accuracy: 97.19
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_in21k-pre_3rdparty_in1k_20221009-472f11e2.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_medium_224_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-medium-p16_64xb64_in1k.py
+  - Name: deit3-base-p16_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86585320
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.80
+        Top 5 Accuracy: 96.55
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k_20221008-60b8c8bf.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-base-p16_64xb64_in1k.py
+  - Name: deit3-base-p16_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 55538974464
+      Parameters: 86877160
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.08
+        Top 5 Accuracy: 97.25
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k-384px_20221009-e19e36d4.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_384_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
+  - Name: deit3-base-p16_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86585320
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.70
+        Top 5 Accuracy: 97.75
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k_20221009-87983ca1.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_224_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-base-p16_64xb64_in1k.py
+  - Name: deit3-base-p16_in21k-pre_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 55538974464
+      Parameters: 86877160
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 86.73
+        Top 5 Accuracy: 98.11
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k-384px_20221009-5e4e37b9.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_384_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
+  - Name: deit3-large-p16_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 61603111936
+      Parameters: 304374760
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.87
+        Top 5 Accuracy: 97.01
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k_20221009-03b427ea.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-large-p16_64xb64_in1k.py
+  - Name: deit3-large-p16_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 191210034176
+      Parameters: 304763880
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.82
+        Top 5 Accuracy: 97.60
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k-384px_20221009-4317ce62.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_384_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
+  - Name: deit3-large-p16_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 61603111936
+      Parameters: 304374760
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 86.97
+        Top 5 Accuracy: 98.24
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k_20221009-d8d27084.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_224_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-large-p16_64xb64_in1k.py
+  - Name: deit3-large-p16_in21k-pre_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 191210034176
+      Parameters: 304763880
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 87.73
+        Top 5 Accuracy: 98.51
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k-384px_20221009-75fea03f.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_384_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
+  - Name: deit3-huge-p14_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 167400741120
+      Parameters: 632126440
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.21
+        Top 5 Accuracy: 97.36
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_3rdparty_in1k_20221009-e107bcb7.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_huge_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-huge-p14_64xb32_in1k.py
+  - Name: deit3-huge-p14_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 167400741120
+      Parameters: 632126440
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 87.19
+        Top 5 Accuracy: 98.26
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_in21k-pre_3rdparty_in1k_20221009-19b8a535.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_huge_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-huge-p14_64xb32_in1k.py
diff --git a/configs/densecl/README.md b/configs/densecl/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d1e1295d9f6a12d47196e6d2c4663d0758076167
--- /dev/null
+++ b/configs/densecl/README.md
@@ -0,0 +1,85 @@
+# DenseCL
+
+> [Dense contrastive learning for self-supervised visual pre-training](https://arxiv.org/abs/2011.09157)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/149721111-bab03a6d-a30d-418e-b338-43c3689cfc65.png" width="900" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('densecl_resnet50_8xb32-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                    | Params (M) | Flops (G) |                       Config                        |                                          Download                                          |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------: | :----------------------------------------------------------------------------------------: |
+| `densecl_resnet50_8xb32-coslr-200e_in1k` |   64.85    |   4.11    | [config](densecl_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k` | [DENSECL](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) |   25.56    |   4.11    |   63.50   | [config](benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{wang2021dense,
+  title={Dense contrastive learning for self-supervised visual pre-training},
+  author={Wang, Xinlong and Zhang, Rufeng and Shen, Chunhua and Kong, Tao and Li, Lei},
+  booktitle={CVPR},
+  year={2021}
+}
+```
diff --git a/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py b/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..37795d9c866c5f9b26b0e016959a01677b8a216e
--- /dev/null
+++ b/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_steplr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=30., momentum=0.9, weight_decay=0.))
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py b/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a3959f1a91c1911e426563759795afeef71bea0
--- /dev/null
+++ b/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_mocov2.py',
+    '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='DenseCL',
+    queue_len=65536,
+    feat_dim=128,
+    momentum=0.001,
+    loss_lambda=0.5,
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='BN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='DenseCLNeck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_grid=None),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.2),
+)
+find_unused_parameters = True
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/densecl/metafile.yml b/configs/densecl/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..24449910aaa5930cbd32ec8fae18dec62ee73505
--- /dev/null
+++ b/configs/densecl/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+  - Name: DenseCL
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+    Paper:
+      Title: Dense contrastive learning for self-supervised visual pre-training
+      URL: https://arxiv.org/abs/2011.09157
+    README: configs/densecl/README.md
+
+Models:
+  - Name: densecl_resnet50_8xb32-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 64850560
+      Training Data: ImageNet-1k
+    In Collection: DenseCL
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth
+    Config: configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
+  - Name: resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: DenseCL
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 63.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
+    Config: configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
diff --git a/configs/densenet/README.md b/configs/densenet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fe40fdd99cf069d76b4937e96ae252c5122ba953
--- /dev/null
+++ b/configs/densenet/README.md
@@ -0,0 +1,82 @@
+# DenseNet
+
+> [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/162675098-9a670883-b13a-4a5a-a9c9-06c39c616a0a.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('densenet121_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('densenet121_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/densenet/densenet121_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                        Download                                        |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `densenet121_3rdparty_in1k`\* | From scratch |    7.98    |   2.88    |   74.96   |   92.21   | [config](densenet121_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth) |
+| `densenet169_3rdparty_in1k`\* | From scratch |   14.15    |   3.42    |   76.08   |   93.11   | [config](densenet169_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth) |
+| `densenet201_3rdparty_in1k`\* | From scratch |   20.01    |   4.37    |   77.32   |   93.64   | [config](densenet201_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth) |
+| `densenet161_3rdparty_in1k`\* | From scratch |   28.68    |   7.82    |   77.61   |   93.83   | [config](densenet161_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.1608.06993,
+      doi = {10.48550/ARXIV.1608.06993},
+      url = {https://arxiv.org/abs/1608.06993},
+      author = {Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q.},
+      keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+      title = {Densely Connected Convolutional Networks},
+      publisher = {arXiv},
+      year = {2016},
+      copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```
diff --git a/configs/densenet/densenet121_4xb256_in1k.py b/configs/densenet/densenet121_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc9854f5b44da27bcf4a5a4d5faefca625dc85b0
--- /dev/null
+++ b/configs/densenet/densenet121_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/densenet/densenet121.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/densenet161_4xb256_in1k.py b/configs/densenet/densenet161_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a28a278bfc8132f4099afc576c43b05fd4095fd0
--- /dev/null
+++ b/configs/densenet/densenet161_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/densenet/densenet161.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/densenet169_4xb256_in1k.py b/configs/densenet/densenet169_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..73469da115d23da250d790d68a36f55fb8eccfff
--- /dev/null
+++ b/configs/densenet/densenet169_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/densenet/densenet169.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/densenet201_4xb256_in1k.py b/configs/densenet/densenet201_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a9b7b1923351fc1f47ad1aa0e4470316e076e96
--- /dev/null
+++ b/configs/densenet/densenet201_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/densenet/densenet201.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/metafile.yml b/configs/densenet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40575acb6b4314d8ebc5c9317e9e032e0a8b0cba
--- /dev/null
+++ b/configs/densenet/metafile.yml
@@ -0,0 +1,76 @@
+Collections:
+  - Name: DenseNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - DenseBlock
+    Paper:
+      URL: https://arxiv.org/abs/1608.06993
+      Title: Densely Connected Convolutional Networks
+    README: configs/densenet/README.md
+
+Models:
+  - Name: densenet121_3rdparty_in1k
+    Metadata:
+      FLOPs: 2881695488
+      Parameters: 7978856
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.96
+          Top 5 Accuracy: 92.21
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
+    Config: configs/densenet/densenet121_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet121-a639ec97.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet169_3rdparty_in1k
+    Metadata:
+      FLOPs: 3416860160
+      Parameters: 14149480
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.08
+          Top 5 Accuracy: 93.11
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth
+    Config: configs/densenet/densenet169_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet169-b2777c0a.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet201_3rdparty_in1k
+    Metadata:
+      FLOPs: 4365236736
+      Parameters: 20013928
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.32
+          Top 5 Accuracy: 93.64
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth
+    Config: configs/densenet/densenet201_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet201-c1103571.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet161_3rdparty_in1k
+    Metadata:
+      FLOPs: 7816363968
+      Parameters: 28681000
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.61
+          Top 5 Accuracy: 93.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth
+    Config: configs/densenet/densenet161_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet161-8d451a50.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
diff --git a/configs/dinov2/README.md b/configs/dinov2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..aa79d6b43c677f96236a52630b39ca9a6e381e5d
--- /dev/null
+++ b/configs/dinov2/README.md
@@ -0,0 +1,58 @@
+# DINOv2
+
+> [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing allpurpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/234560516-b495795c-c75c-444c-a712-bb61a3de444e.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-small-p14_dinov2-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                 | Params (M) | Flops (G) |                     Config                     |                                              Download                                              |
+| :------------------------------------ | :--------: | :-------: | :--------------------------------------------: | :------------------------------------------------------------------------------------------------: |
+| `vit-small-p14_dinov2-pre_3rdparty`\* |   22.06    |   46.76   | [config](vit-small-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth) |
+| `vit-base-p14_dinov2-pre_3rdparty`\*  |   86.58    |  152.00   | [config](vit-base-p14_dinov2-pre_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth) |
+| `vit-large-p14_dinov2-pre_3rdparty`\* |   304.00   |  507.00   | [config](vit-large-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth) |
+| `vit-giant-p14_dinov2-pre_3rdparty`\* |  1136.00   |  1784.00  | [config](vit-giant-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/dinov2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{oquab2023dinov2,
+  title={DINOv2: Learning Robust Visual Features without Supervision},
+  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
+  journal={arXiv:2304.07193},
+  year={2023}
+}
+```
diff --git a/configs/dinov2/metafile.yml b/configs/dinov2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..48f205a24abf006019fa00041bfc8cb5a138aa55
--- /dev/null
+++ b/configs/dinov2/metafile.yml
@@ -0,0 +1,73 @@
+Collections:
+  - Name: DINOv2
+    Metadata:
+      Architecture:
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'DINOv2: Learning Robust Visual Features without Supervision'
+      URL: https://arxiv.org/abs/2304.07193
+    README: configs/dinov2/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: vit-small-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 46762000000
+      Parameters: 22056000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth
+    Config: configs/dinov2/vit-small-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+
+  - Name: vit-base-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 152000000000
+      Parameters: 86580000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth
+    Config: configs/dinov2/vit-base-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+
+  - Name: vit-large-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 507000000000
+      Parameters: 304000000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth
+    Config: configs/dinov2/vit-large-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+
+  - Name: vit-giant-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 1784000000000
+      Parameters: 1136000000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth
+    Config: configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
diff --git a/configs/dinov2/vit-base-p14_dinov2-pre_headless.py b/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..524dfe30bf47db1614d203097ffcfeeec5f68c1a
--- /dev/null
+++ b/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py b/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..a127359e5c44b6fa99482c3720cc1555432af699
--- /dev/null
+++ b/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='dinov2-giant',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+        layer_cfgs=dict(ffn_type='swiglu_fused'),
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/dinov2/vit-large-p14_dinov2-pre_headless.py b/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ec7bc68455520bef8986a8d563e5c732f3bf994
--- /dev/null
+++ b/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/dinov2/vit-small-p14_dinov2-pre_headless.py b/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..198c5e51ab29be9202ac053c082366ec818e3982
--- /dev/null
+++ b/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='dinov2-small',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/edgenext/README.md b/configs/edgenext/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1c9686f7d96183feb115f2bb6860688e48440ed8
--- /dev/null
+++ b/configs/edgenext/README.md
@@ -0,0 +1,80 @@
+# EdgeNeXt
+
+> [EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications](https://arxiv.org/abs/2206.10589)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
+
+<div align=center>
+<img src="https://github.com/mmaaz60/EdgeNeXt/raw/main/images/EdgeNext.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('edgenext-xxsmall_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('edgenext-xxsmall_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/edgenext/edgenext-xxsmall_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                    |                                 Download                                 |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :----------------------------------------------------------------------: |
+| `edgenext-xxsmall_3rdparty_in1k`\*   | From scratch |    1.33    |   0.26    |   71.20   |   89.91   |  [config](edgenext-xxsmall_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth) |
+| `edgenext-xsmall_3rdparty_in1k`\*    | From scratch |    2.34    |   0.53    |   74.86   |   92.31   |  [config](edgenext-xsmall_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth) |
+| `edgenext-small_3rdparty_in1k`\*     | From scratch |    5.59    |   1.25    |   79.41   |   94.53   |   [config](edgenext-small_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth) |
+| `edgenext-small-usi_3rdparty_in1k`\* | From scratch |    5.59    |   1.25    |   81.06   |   95.34   | [config](edgenext-small_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth) |
+| `edgenext-base_3rdparty_in1k`\*      | From scratch |   18.51    |   3.81    |   82.48   |   96.20   |   [config](edgenext-base_8xb256_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth) |
+| `edgenext-base_3rdparty-usi_in1k`\*  | From scratch |   18.51    |   3.81    |   83.67   |   96.70   | [config](edgenext-base_8xb256-usi_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/mmaaz60/EdgeNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Maaz2022EdgeNeXt,
+    title={EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications},
+    author={Muhammad Maaz and Abdelrahman Shaker and Hisham Cholakkal and Salman Khan and Syed Waqas Zamir and Rao Muhammad Anwer and Fahad Shahbaz Khan},
+    journal={2206.10589},
+    year={2022}
+}
+```
diff --git a/configs/edgenext/edgenext-base_8xb256-usi_in1k.py b/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..13949deaed9b09f7473fca60d4bab2012ce00c48
--- /dev/null
+++ b/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
@@ -0,0 +1,19 @@
+_base_ = ['./edgenext-base_8xb256_in1k.py']
+
+# dataset setting
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=269,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/edgenext/edgenext-base_8xb256_in1k.py b/configs/edgenext/edgenext-base_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d0a75c62fe0c771e65541937ca32b9b7ca3e9e0
--- /dev/null
+++ b/configs/edgenext/edgenext-base_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/edgenext/edgenext-base.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/edgenext-small_8xb256-usi_in1k.py b/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6bc904be7f7e82eb3b9769260dd3559ee33e45f
--- /dev/null
+++ b/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
@@ -0,0 +1,19 @@
+_base_ = ['./edgenext-small_8xb256_in1k.py']
+
+# dataset setting
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=269,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/edgenext/edgenext-small_8xb256_in1k.py b/configs/edgenext/edgenext-small_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1d99bdc9f6958037306c98ba863ffb8743fa347
--- /dev/null
+++ b/configs/edgenext/edgenext-small_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/edgenext/edgenext-small.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/edgenext-xsmall_8xb256_in1k.py b/configs/edgenext/edgenext-xsmall_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d2326fc9deda56d1366a4ec9cafff4e4740c24c
--- /dev/null
+++ b/configs/edgenext/edgenext-xsmall_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/edgenext/edgenext-xsmall.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py b/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..507c3cb598fab10416d621e0e4cf4f78114a7918
--- /dev/null
+++ b/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/edgenext/edgenext-xxsmall.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/metafile.yml b/configs/edgenext/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e69ac17405ea5081c515e8a48ff550e09675e867
--- /dev/null
+++ b/configs/edgenext/metafile.yml
@@ -0,0 +1,118 @@
+Collections:
+  - Name: EdgeNeXt
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - SDTA
+        - 1x1 Convolution
+        - Channel Self-attention
+    Paper:
+      URL: https://arxiv.org/abs/2206.10589
+      Title: 'EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications'
+    README: configs/edgenext/README.md
+    Code:
+      Version: v1.0.0rc1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.2/mmcls/models/backbones/edgenext.py
+
+Models:
+  - Name: edgenext-xxsmall_3rdparty_in1k
+    Metadata:
+      FLOPs: 255640144
+      Parameters: 1327216
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.20
+          Top 5 Accuracy: 89.91
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
+    Config: configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xxsmall.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-xsmall_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 529970560
+      Parameters: 2336804
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.86
+          Top 5 Accuracy: 92.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth
+    Config: configs/edgenext/edgenext-xsmall_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xsmall.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-small_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1249788000
+      Parameters: 5586832
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.41
+          Top 5 Accuracy: 94.53
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth
+    Config: configs/edgenext/edgenext-small_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_small.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-small-usi_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1249788000
+      Parameters: 5586832
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.06
+          Top 5 Accuracy: 95.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth
+    Config: configs/edgenext/edgenext-small_8xb256-usi_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.1/edgenext_small_usi.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-base_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3814395280
+      Parameters: 18511292
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.48
+          Top 5 Accuracy: 96.2
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth
+    Config: configs/edgenext/edgenext-base_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-base_3rdparty-usi_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3814395280
+      Parameters: 18511292
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.67
+          Top 5 Accuracy: 96.7
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth
+    Config: configs/edgenext/edgenext-base_8xb256-usi_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base_usi.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
diff --git a/configs/efficientformer/README.md b/configs/efficientformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..537777efc0da6cba6aa198ab204945a1c3712688
--- /dev/null
+++ b/configs/efficientformer/README.md
@@ -0,0 +1,88 @@
+# EfficientFormer
+
+> [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm.  Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/180713426-9d3d77e3-3584-42d8-9098-625b4170d796.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('efficientformer-l1_3rdparty_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('efficientformer-l1_3rdparty_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/efficientformer/efficientformer-l1_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                       |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                    |                             Download                              |
+| :------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :---------------------------------------------------------------: |
+| `efficientformer-l1_3rdparty_8xb128_in1k`\* | From scratch |   12.28    |   1.30    |   80.46   |   94.99   | [config](efficientformer-l1_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth) |
+| `efficientformer-l3_3rdparty_8xb128_in1k`\* | From scratch |   31.41    |   3.74    |   82.45   |   96.18   | [config](efficientformer-l3_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l3_3rdparty_in1k_20220915-466793d6.pth) |
+| `efficientformer-l7_3rdparty_8xb128_in1k`\* | From scratch |   82.23    |   10.16   |   83.40   |   96.60   | [config](efficientformer-l7_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l7_3rdparty_in1k_20220915-185e30af.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/snap-research/EfficientFormer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2206.01191,
+  doi = {10.48550/ARXIV.2206.01191},
+
+  url = {https://arxiv.org/abs/2206.01191},
+
+  author = {Li, Yanyu and Yuan, Geng and Wen, Yang and Hu, Eric and Evangelidis, Georgios and Tulyakov, Sergey and Wang, Yanzhi and Ren, Jian},
+
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+
+  title = {EfficientFormer: Vision Transformers at MobileNet Speed},
+
+  publisher = {arXiv},
+
+  year = {2022},
+
+  copyright = {Creative Commons Attribution 4.0 International}
+}
+```
diff --git a/configs/efficientformer/efficientformer-l1_8xb128_in1k.py b/configs/efficientformer/efficientformer-l1_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f55dc653eccad42dcf95d60f9aab86460ca9117
--- /dev/null
+++ b/configs/efficientformer/efficientformer-l1_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/efficientformer-l1.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/efficientformer/efficientformer-l3_8xb128_in1k.py b/configs/efficientformer/efficientformer-l3_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8be5efae1ad93f175c25eabc6361a20c1ece76f
--- /dev/null
+++ b/configs/efficientformer/efficientformer-l3_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './efficientformer-l1_8xb128_in1k.py'
+
+model = dict(backbone=dict(arch='l3'), head=dict(in_channels=512))
diff --git a/configs/efficientformer/efficientformer-l7_8xb128_in1k.py b/configs/efficientformer/efficientformer-l7_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2252652efe55840880ad64cde121a51614f4e84
--- /dev/null
+++ b/configs/efficientformer/efficientformer-l7_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './efficientformer-l1_8xb128_in1k.py'
+
+model = dict(backbone=dict(arch='l7'), head=dict(in_channels=768))
diff --git a/configs/efficientformer/metafile.yml b/configs/efficientformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5c70f07ec52f956e0644d4e25d4162ed009ac72a
--- /dev/null
+++ b/configs/efficientformer/metafile.yml
@@ -0,0 +1,67 @@
+Collections:
+  - Name: EfficientFormer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Pooling
+        - 1x1 Convolution
+        - LayerScale
+        - MetaFormer
+    Paper:
+      URL: https://arxiv.org/abs/2206.01191
+      Title: "EfficientFormer: Vision Transformers at MobileNet Speed"
+    README: configs/efficientformer/README.md
+    Code:
+      Version: v1.0.0rc1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/efficientformer/metafile.yml
+
+Models:
+  - Name: efficientformer-l1_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 1304601088     # 1.3G
+      Parameters: 12278696   # 12M
+    In Collection: EfficientFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.46
+          Top 5 Accuracy: 94.99
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth
+    Config: configs/efficientformer/efficientformer-l1_8xb128_in1k.py
+    Converted From:
+      Weights: https://drive.google.com/file/d/11SbX-3cfqTOc247xKYubrAjBiUmr818y/view?usp=sharing
+      Code: https://github.com/snap-research/EfficientFormer
+  - Name: efficientformer-l3_3rdparty_8xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3737045760        # 3.7G
+      Parameters: 31406000     # 31M
+    In Collection: EfficientFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.45
+          Top 5 Accuracy: 96.18
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l3_3rdparty_in1k_20220915-466793d6.pth
+    Config: configs/efficientformer/efficientformer-l3_8xb128_in1k.py
+    Converted From:
+      Weights: https://drive.google.com/file/d/1OyyjKKxDyMj-BcfInp4GlDdwLu3hc30m/view?usp=sharing
+      Code: https://github.com/snap-research/EfficientFormer
+  - Name: efficientformer-l7_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 10163951616      # 10.2G
+      Parameters: 82229328    # 82M
+    In Collection: EfficientFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.40
+          Top 5 Accuracy: 96.60
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l7_3rdparty_in1k_20220915-185e30af.pth
+    Config: configs/efficientformer/efficientformer-l7_8xb128_in1k.py
+    Converted From:
+      Weights: https://drive.google.com/file/d/1cVw-pctJwgvGafeouynqWWCwgkcoFMM5/view?usp=sharing
+      Code: https://github.com/snap-research/EfficientFormer
diff --git a/configs/efficientnet/README.md b/configs/efficientnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c7b7b76ab5db29c3f9bc54eaefffdcf9cda4c13a
--- /dev/null
+++ b/configs/efficientnet/README.md
@@ -0,0 +1,122 @@
+# EfficientNet
+
+> [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946v5)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.
+
+EfficientNets are based on AutoML and Compound Scaling. In particular, we first use [AutoML MNAS Mobile framework](https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html) to develop a mobile-size baseline network, named as EfficientNet-B0; Then, we use the compound scaling method to scale up this baseline to obtain EfficientNet-B1 to B7.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/150078232-d28c91fc-d0e8-43e3-9d20-b5162f0fb463.png" width="60%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Click to show the detailed Abstract</summary>
+
+<br>
+Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.   To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('efficientnet-b0_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('efficientnet-b0_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/efficientnet/efficientnet-b0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                               |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                     |                        Download                        |
+| :-------------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :----------------------------------------------------: |
+| `efficientnet-b0_3rdparty_8xb32_in1k`\*             | From scratch |    5.29    |   0.42    |   76.74   |   93.17   |    [config](efficientnet-b0_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth) |
+| `efficientnet-b0_3rdparty_8xb32-aa_in1k`\*          | From scratch |    5.29    |   0.42    |   77.26   |   93.41   |    [config](efficientnet-b0_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa_in1k_20220119-8d939117.pth) |
+| `efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |    5.29    |   0.42    |   77.53   |   93.61   | [config](efficientnet-b0_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k_20220119-26434485.pth) |
+| `efficientnet-b0_3rdparty-ra-noisystudent_in1k`\*   | From scratch |    5.29    |   0.42    |   77.63   |   94.00   |    [config](efficientnet-b0_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty-ra-noisystudent_in1k_20221103-75cd08d3.pth) |
+| `efficientnet-b1_3rdparty_8xb32_in1k`\*             | From scratch |    7.79    |   0.74    |   78.68   |   94.28   |    [config](efficientnet-b1_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32_in1k_20220119-002556d9.pth) |
+| `efficientnet-b1_3rdparty_8xb32-aa_in1k`\*          | From scratch |    7.79    |   0.74    |   79.20   |   94.42   |    [config](efficientnet-b1_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa_in1k_20220119-619d8ae3.pth) |
+| `efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |    7.79    |   0.74    |   79.52   |   94.43   | [config](efficientnet-b1_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k_20220119-5715267d.pth) |
+| `efficientnet-b1_3rdparty-ra-noisystudent_in1k`\*   | From scratch |    7.79    |   0.74    |   81.44   |   95.83   |    [config](efficientnet-b1_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty-ra-noisystudent_in1k_20221103-756bcbc0.pth) |
+| `efficientnet-b2_3rdparty_8xb32_in1k`\*             | From scratch |    9.11    |   1.07    |   79.64   |   94.80   |    [config](efficientnet-b2_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32_in1k_20220119-ea374a30.pth) |
+| `efficientnet-b2_3rdparty_8xb32-aa_in1k`\*          | From scratch |    9.11    |   1.07    |   80.21   |   94.96   |    [config](efficientnet-b2_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa_in1k_20220119-dd61e80b.pth) |
+| `efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |    9.11    |   1.07    |   80.45   |   95.07   | [config](efficientnet-b2_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k_20220119-1655338a.pth) |
+| `efficientnet-b2_3rdparty-ra-noisystudent_in1k`\*   | From scratch |    9.11    |   1.07    |   82.47   |   96.23   |    [config](efficientnet-b2_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty-ra-noisystudent_in1k_20221103-301ed299.pth) |
+| `efficientnet-b3_3rdparty_8xb32_in1k`\*             | From scratch |   12.23    |   1.95    |   81.01   |   95.34   |    [config](efficientnet-b3_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32_in1k_20220119-4b4d7487.pth) |
+| `efficientnet-b3_3rdparty_8xb32-aa_in1k`\*          | From scratch |   12.23    |   1.95    |   81.58   |   95.67   |    [config](efficientnet-b3_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa_in1k_20220119-5b4887a0.pth) |
+| `efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   12.23    |   1.95    |   81.81   |   95.69   | [config](efficientnet-b3_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k_20220119-53b41118.pth) |
+| `efficientnet-b3_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   12.23    |   1.95    |   84.02   |   96.89   |    [config](efficientnet-b3_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty-ra-noisystudent_in1k_20221103-a4ab5fd6.pth) |
+| `efficientnet-b4_3rdparty_8xb32_in1k`\*             | From scratch |   19.34    |   4.66    |   82.57   |   96.09   |    [config](efficientnet-b4_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32_in1k_20220119-81fd4077.pth) |
+| `efficientnet-b4_3rdparty_8xb32-aa_in1k`\*          | From scratch |   19.34    |   4.66    |   82.95   |   96.26   |    [config](efficientnet-b4_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa_in1k_20220119-45b8bd2b.pth) |
+| `efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   19.34    |   4.66    |   83.25   |   96.44   | [config](efficientnet-b4_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k_20220119-38c2238c.pth) |
+| `efficientnet-b4_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   19.34    |   4.66    |   85.25   |   97.52   |    [config](efficientnet-b4_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty-ra-noisystudent_in1k_20221103-16ba8a2d.pth) |
+| `efficientnet-b5_3rdparty_8xb32_in1k`\*             | From scratch |   30.39    |   10.80   |   83.18   |   96.47   |    [config](efficientnet-b5_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32_in1k_20220119-e9814430.pth) |
+| `efficientnet-b5_3rdparty_8xb32-aa_in1k`\*          | From scratch |   30.39    |   10.80   |   83.82   |   96.76   |    [config](efficientnet-b5_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa_in1k_20220119-2cab8b78.pth) |
+| `efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   30.39    |   10.80   |   84.21   |   96.98   | [config](efficientnet-b5_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k_20220119-f57a895a.pth) |
+| `efficientnet-b5_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   30.39    |   10.80   |   86.08   |   97.75   |    [config](efficientnet-b5_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty-ra-noisystudent_in1k_20221103-111a185f.pth) |
+| `efficientnet-b6_3rdparty_8xb32-aa_in1k`\*          | From scratch |   43.04    |   19.97   |   84.05   |   96.82   |    [config](efficientnet-b6_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa_in1k_20220119-45b03310.pth) |
+| `efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   43.04    |   19.97   |   84.74   |   97.14   | [config](efficientnet-b6_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k_20220119-bfe3485e.pth) |
+| `efficientnet-b6_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   43.04    |   19.97   |   86.47   |   97.87   |    [config](efficientnet-b6_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty-ra-noisystudent_in1k_20221103-7de7d2cc.pth) |
+| `efficientnet-b7_3rdparty_8xb32-aa_in1k`\*          | From scratch |   66.35    |   39.32   |   84.38   |   96.88   |    [config](efficientnet-b7_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa_in1k_20220119-bf03951c.pth) |
+| `efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   66.35    |   39.32   |   85.14   |   97.23   | [config](efficientnet-b7_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k_20220119-c6dbff10.pth) |
+| `efficientnet-b7_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   66.35    |   39.32   |   86.83   |   98.08   |    [config](efficientnet-b7_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty-ra-noisystudent_in1k_20221103-a82894bc.pth) |
+| `efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   87.41    |   65.00   |   85.38   |   97.28   | [config](efficientnet-b8_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k_20220119-297ce1b7.pth) |
+| `efficientnet-l2_3rdparty-ra-noisystudent_in1k-800px`\* | From scratch |   480.31   |  174.20   |   88.33   |   98.65   |  [config](efficientnet-l2_8xb8_in1k-800px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k_20221103-be73be13.pth) |
+| `efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px`\* | From scratch |   480.31   |  484.98   |   88.18   |   98.55   | [config](efficientnet-l2_8xb32_in1k-475px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px_20221103-5a0d8058.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{tan2019efficientnet,
+  title={Efficientnet: Rethinking model scaling for convolutional neural networks},
+  author={Tan, Mingxing and Le, Quoc},
+  booktitle={International Conference on Machine Learning},
+  pages={6105--6114},
+  year={2019},
+  organization={PMLR}
+}
+```
diff --git a/configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..369d0a43d1950de5da47789d0f28465c95fdaae5
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b0.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b0_8xb32_in1k.py b/configs/efficientnet/efficientnet-b0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4263da196430b310fae4da3273d13bb66e89075
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b0_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b0.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0405cf5f84eeedf0a2e761670bc600d9f82401af
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b1.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=240),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=240),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b1_8xb32_in1k.py b/configs/efficientnet/efficientnet-b1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5bf2e8076d81c97adb4d1883cfbdb5f645b6b93
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b1_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b1.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=240),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=240),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..da3f23b84c6f7fc8b5d415b90ca2f69f4d6e58c4
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b2.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=260),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=260),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b2_8xb32_in1k.py b/configs/efficientnet/efficientnet-b2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..060a2ad3ea9247131c4207d738dce0bfacd16a16
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b2_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b2.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=260),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=260),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..55729a9c2258352a6ed981dff25777b0acaaae85
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b3.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=300),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=300),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b3_8xb32_in1k.py b/configs/efficientnet/efficientnet-b3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d84de5a79316ab6d7f73e45f266fbaec43ed9629
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b3_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b3.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=300),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=300),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4dbfb212fd03d508b678a684f4d8b6854f648c6
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b4.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=380),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=380),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b4_8xb32_in1k.py b/configs/efficientnet/efficientnet-b4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..08e246c3851d12ee067469d9afb10fc7f0933de7
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b4_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b4.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=380),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=380),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c646da43d4baf23cebfc6835ec400dba6d5bd35
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b5.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=456),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=456),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b5_8xb32_in1k.py b/configs/efficientnet/efficientnet-b5_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..af4fa4b8fcbce99ae1ac163c72cec11789109482
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b5_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b5.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=456),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=456),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd15054928b56bdae2c3a2ef479e96826824fe2b
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b6.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=528),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=528),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b6_8xb32_in1k.py b/configs/efficientnet/efficientnet-b6_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fae02aed6dd5b8fbb1b42140856333b771c927d1
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b6_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b6.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=528),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=528),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..687dfd261d73d84061b289c955cb0260059999b2
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b7.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=600),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=600),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b7_8xb32_in1k.py b/configs/efficientnet/efficientnet-b7_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d783bb30bf1939aa1c8c9a010e5733ae7b1342b
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b7_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b7.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=600),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=600),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..07d3692baa9b9f3d10109e63d1da5e74cc62ee26
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b8.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=672),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=672),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b8_8xb32_in1k.py b/configs/efficientnet/efficientnet-b8_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..868986f52488233b36631c13d66d8da2aac8c348
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b8_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b8.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=672),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=672),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-em_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-em_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9de3b27fb31a1382c08a646987b7cf4d996e77f4
--- /dev/null
+++ b/configs/efficientnet/efficientnet-em_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_em.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=240),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=240),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-es_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-es_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e643d55089b932732d47c5dbe5734c2085a2fb3e
--- /dev/null
+++ b/configs/efficientnet/efficientnet-es_8xb32-01norm_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_es.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py b/configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py
new file mode 100644
index 0000000000000000000000000000000000000000..560695144f50194c00bc78707c8ddf7288e4cd52
--- /dev/null
+++ b/configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_l2.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=475),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=475),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py b/configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py
new file mode 100644
index 0000000000000000000000000000000000000000..61bddfa735117db68377a224f72c1160a046ae1c
--- /dev/null
+++ b/configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_l2.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=800),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=800),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=8, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=8, dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(batch_size=8, dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/metafile.yml b/configs/efficientnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..21130c4ff1d64895295372acac18961a4f90bd7c
--- /dev/null
+++ b/configs/efficientnet/metafile.yml
@@ -0,0 +1,551 @@
+Collections:
+  - Name: EfficientNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - Average Pooling
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - Inverted Residual Block
+        - RMSProp
+        - Squeeze-and-Excitation Block
+        - Swish
+    Paper:
+      URL: https://arxiv.org/abs/1905.11946v5
+      Title: "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks"
+    README: configs/efficientnet/README.md
+    Code:
+      Version: v0.20.1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/efficientnet.py
+
+Models:
+  - Name: efficientnet-b0_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 420592480
+      Parameters: 5288548
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.74
+          Top 5 Accuracy: 93.17
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth
+    Config: configs/efficientnet/efficientnet-b0_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b0.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b0_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 420592480
+      Parameters: 5288548
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.26
+          Top 5 Accuracy: 93.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa_in1k_20220119-8d939117.pth
+    Config: configs/efficientnet/efficientnet-b0_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b0.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 420592480
+      Parameters: 5288548
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.53
+          Top 5 Accuracy: 93.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k_20220119-26434485.pth
+    Config: configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b0.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b0_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 420592480
+      Parameters: 5288548
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.63
+          Top 5 Accuracy: 94.00
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty-ra-noisystudent_in1k_20221103-75cd08d3.pth
+    Config: configs/efficientnet/efficientnet-b0_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b0.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b1_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 744059920
+      Parameters: 7794184
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.68
+          Top 5 Accuracy: 94.28
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32_in1k_20220119-002556d9.pth
+    Config: configs/efficientnet/efficientnet-b1_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b1.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b1_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 744059920
+      Parameters: 7794184
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.20
+          Top 5 Accuracy: 94.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa_in1k_20220119-619d8ae3.pth
+    Config: configs/efficientnet/efficientnet-b1_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b1.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 744059920
+      Parameters: 7794184
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.52
+          Top 5 Accuracy: 94.43
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k_20220119-5715267d.pth
+    Config: configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b1.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b1_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 744059920
+      Parameters: 7794184
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.44
+          Top 5 Accuracy: 95.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty-ra-noisystudent_in1k_20221103-756bcbc0.pth
+    Config: configs/efficientnet/efficientnet-b1_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b1.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b2_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 1066620392
+      Parameters: 9109994
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.64
+          Top 5 Accuracy: 94.80
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32_in1k_20220119-ea374a30.pth
+    Config: configs/efficientnet/efficientnet-b2_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b2_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 1066620392
+      Parameters: 9109994
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.21
+          Top 5 Accuracy: 94.96
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa_in1k_20220119-dd61e80b.pth
+    Config: configs/efficientnet/efficientnet-b2_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 1066620392
+      Parameters: 9109994
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.45
+          Top 5 Accuracy: 95.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k_20220119-1655338a.pth
+    Config: configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b2_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 1066620392
+      Parameters: 9109994
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.47
+          Top 5 Accuracy: 96.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty-ra-noisystudent_in1k_20221103-301ed299.pth
+    Config: configs/efficientnet/efficientnet-b2_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b3_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 1953798216
+      Parameters: 12233232
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.01
+          Top 5 Accuracy: 95.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32_in1k_20220119-4b4d7487.pth
+    Config: configs/efficientnet/efficientnet-b3_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b3.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b3_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 1953798216
+      Parameters: 12233232
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.58
+          Top 5 Accuracy: 95.67
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa_in1k_20220119-5b4887a0.pth
+    Config: configs/efficientnet/efficientnet-b3_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b3.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 1953798216
+      Parameters: 12233232
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.81
+          Top 5 Accuracy: 95.69
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k_20220119-53b41118.pth
+    Config: configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b3.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b3_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 1953798216
+      Parameters: 12233232
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.02
+          Top 5 Accuracy: 96.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty-ra-noisystudent_in1k_20221103-a4ab5fd6.pth
+    Config: configs/efficientnet/efficientnet-b3_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b3.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b4_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 4659080176
+      Parameters: 19341616
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.57
+          Top 5 Accuracy: 96.09
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32_in1k_20220119-81fd4077.pth
+    Config: configs/efficientnet/efficientnet-b4_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b4.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b4_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 4659080176
+      Parameters: 19341616
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.95
+          Top 5 Accuracy: 96.26
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa_in1k_20220119-45b8bd2b.pth
+    Config: configs/efficientnet/efficientnet-b4_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b4.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 4659080176
+      Parameters: 19341616
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.25
+          Top 5 Accuracy: 96.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k_20220119-38c2238c.pth
+    Config: configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b4.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b4_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 4659080176
+      Parameters: 19341616
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.25
+          Top 5 Accuracy: 97.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty-ra-noisystudent_in1k_20221103-16ba8a2d.pth
+    Config: configs/efficientnet/efficientnet-b4_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b4.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b5_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 10799472560
+      Parameters: 30389784
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.18
+          Top 5 Accuracy: 96.47
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32_in1k_20220119-e9814430.pth
+    Config: configs/efficientnet/efficientnet-b5_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b5.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b5_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 10799472560
+      Parameters: 30389784
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.82
+          Top 5 Accuracy: 96.76
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa_in1k_20220119-2cab8b78.pth
+    Config: configs/efficientnet/efficientnet-b5_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b5.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 10799472560
+      Parameters: 30389784
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.21
+          Top 5 Accuracy: 96.98
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k_20220119-f57a895a.pth
+    Config: configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b5.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b5_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 10799472560
+      Parameters: 30389784
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.08
+          Top 5 Accuracy: 97.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty-ra-noisystudent_in1k_20221103-111a185f.pth
+    Config: configs/efficientnet/efficientnet-b5_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b5.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b6_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 19971777560
+      Parameters: 43040704
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.05
+          Top 5 Accuracy: 96.82
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa_in1k_20220119-45b03310.pth
+    Config: configs/efficientnet/efficientnet-b6_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b6.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 19971777560
+      Parameters: 43040704
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.74
+          Top 5 Accuracy: 97.14
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k_20220119-bfe3485e.pth
+    Config: configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b6.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b6_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 19971777560
+      Parameters: 43040704
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.47
+          Top 5 Accuracy: 97.87
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty-ra-noisystudent_in1k_20221103-7de7d2cc.pth
+    Config: configs/efficientnet/efficientnet-b6_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b6.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b7_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 39316473392
+      Parameters: 66347960
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.38
+          Top 5 Accuracy: 96.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa_in1k_20220119-bf03951c.pth
+    Config: configs/efficientnet/efficientnet-b7_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b7.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 39316473392
+      Parameters: 66347960
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.14
+          Top 5 Accuracy: 97.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k_20220119-c6dbff10.pth
+    Config: configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b7.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b7_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 39316473392
+      Parameters: 66347960
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.83
+          Top 5 Accuracy: 98.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty-ra-noisystudent_in1k_20221103-a82894bc.pth
+    Config: configs/efficientnet/efficientnet-b7_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b7.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 64999827816
+      Parameters: 87413142
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.38
+          Top 5 Accuracy: 97.28
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k_20220119-297ce1b7.pth
+    Config: configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b8.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-l2_3rdparty-ra-noisystudent_in1k-800px
+    Metadata:
+      FLOPs: 174203533416
+      Parameters: 480309308
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.33
+          Top 5 Accuracy: 98.65
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k_20221103-be73be13.pth
+    Config: configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-l2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px
+    Metadata:
+      FLOPs: 484984099280
+      Parameters: 480309308
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.18
+          Top 5 Accuracy: 98.55
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px_20221103-5a0d8058.pth
+    Config: configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-l2_475.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
diff --git a/configs/efficientnet_v2/README.md b/configs/efficientnet_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..965421823e7fe3e6cf8504d717864bf8a499ab2e
--- /dev/null
+++ b/configs/efficientnet_v2/README.md
@@ -0,0 +1,98 @@
+# EfficientNetV2
+
+> [EfficientNetV2: Smaller Models and Faster Training](https://arxiv.org/abs/2104.00298)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop this family of models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. The models were searched from the search space enriched with new ops such as Fused-MBConv. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.   Our training can be further sped up by progressively increasing the image size during training, but it often causes a drop in accuracy. To compensate for this accuracy drop, we propose to adaptively adjust regularization (e.g., dropout and data augmentation) as well, such that we can achieve both fast training and good accuracy.   With progressive learning, our EfficientNetV2 significantly outperforms previous models on ImageNet and CIFAR/Cars/Flowers datasets. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. Code will be available at https://github.com/google/automl/tree/master/efficientnetv2.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/208616931-0c5107f1-f08c-48d3-8694-7a6eaf227dc2.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('efficientnetv2-b0_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('efficientnetv2-b0_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b0_3rdparty_in1k_20221221-9ef6e736.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                | Params (M) | Flops (G) |                   Config                   |                                                Download                                                 |
+| :----------------------------------- | :--------: | :-------: | :----------------------------------------: | :-----------------------------------------------------------------------------------------------------: |
+| `efficientnetv2-s_3rdparty_in21k`\*  |   48.16    |   3.31    | [config](efficientnetv2-s_8xb32_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in21k_20221220-c0572b56.pth) |
+| `efficientnetv2-m_3rdparty_in21k`\*  |   80.84    |   5.86    | [config](efficientnetv2-m_8xb32_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in21k_20221220-073e944c.pth) |
+| `efficientnetv2-l_3rdparty_in21k`\*  |   145.22   |   13.11   | [config](efficientnetv2-l_8xb32_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in21k_20221220-f28f91e1.pth) |
+| `efficientnetv2-xl_3rdparty_in21k`\* |   234.82   |   18.86   | [config](efficientnetv2-xl_8xb32_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_3rdparty_in21k_20221220-b2c9329c.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                      |                          Download                           |
+| :-------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------------: | :---------------------------------------------------------: |
+| `efficientnetv2-b0_3rdparty_in1k`\*           | From scratch |    7.14    |   0.92    |   78.52   |   94.44   |    [config](efficientnetv2-b0_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b0_3rdparty_in1k_20221221-9ef6e736.pth) |
+| `efficientnetv2-b1_3rdparty_in1k`\*           | From scratch |    8.14    |   1.44    |   79.80   |   94.89   |    [config](efficientnetv2-b1_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b1_3rdparty_in1k_20221221-6955d9ce.pth) |
+| `efficientnetv2-b2_3rdparty_in1k`\*           | From scratch |   10.10    |   1.99    |   80.63   |   95.30   |    [config](efficientnetv2-b2_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b2_3rdparty_in1k_20221221-74f7d493.pth) |
+| `efficientnetv2-b3_3rdparty_in1k`\*           | From scratch |   14.36    |   3.50    |   82.03   |   95.88   |    [config](efficientnetv2-b3_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b3_3rdparty_in1k_20221221-b6f07a36.pth) |
+| `efficientnetv2-s_3rdparty_in1k`\*            | From scratch |   21.46    |   9.72    |   83.82   |   96.67   | [config](efficientnetv2-s_8xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in1k_20221220-f0eaff9d.pth) |
+| `efficientnetv2-m_3rdparty_in1k`\*            | From scratch |   54.14    |   26.88   |   85.01   |   97.26   | [config](efficientnetv2-m_8xb32_in1k-480px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in1k_20221220-9dc0c729.pth) |
+| `efficientnetv2-l_3rdparty_in1k`\*            | From scratch |   118.52   |   60.14   |   85.43   |   97.31   | [config](efficientnetv2-l_8xb32_in1k-480px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in1k_20221220-5c3bac0f.pth) |
+| `efficientnetv2-s_in21k-pre_3rdparty_in1k`\*  | ImageNet-21k |   21.46    |   9.72    |   84.29   |   97.26   | [config](efficientnetv2-s_8xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_in21k-pre-3rdparty_in1k_20221220-7a7c8475.pth) |
+| `efficientnetv2-m_in21k-pre_3rdparty_in1k`\*  | ImageNet-21k |   54.14    |   26.88   |   85.47   |   97.76   | [config](efficientnetv2-m_8xb32_in1k-480px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_in21k-pre-3rdparty_in1k_20221220-a1013a04.pth) |
+| `efficientnetv2-l_in21k-pre_3rdparty_in1k`\*  | ImageNet-21k |   118.52   |   60.14   |   86.31   |   97.99   | [config](efficientnetv2-l_8xb32_in1k-480px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_in21k-pre-3rdparty_in1k_20221220-63df0efd.pth) |
+| `efficientnetv2-xl_in21k-pre_3rdparty_in1k`\* | ImageNet-21k |   208.12   |   98.34   |   86.39   |   97.83   | [config](efficientnetv2-xl_8xb32_in1k-512px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_in21k-pre-3rdparty_in1k_20221220-583ac18b.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{tan2021efficientnetv2,
+  title={Efficientnetv2: Smaller models and faster training},
+  author={Tan, Mingxing and Le, Quoc},
+  booktitle={International Conference on Machine Learning},
+  pages={10096--10106},
+  year={2021},
+  organization={PMLR}
+}
+```
diff --git a/configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4dc23d4904ef87f3ca581dc022a65f8d9c925038
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+    '../_base_/models/efficientnet_v2/efficientnetv2_b0.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=192,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa187ff1503531732b10e2b178751361e4a4de2d
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py
@@ -0,0 +1,21 @@
+_base_ = ['./efficientnetv2-b0_8xb32_in1k.py']
+
+# model setting
+model = dict(backbone=dict(arch='b1'), head=dict(in_channels=1280, ))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=192),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=240, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ff5530d1dbac739295c6fbc1f61fa6b36d8aa65
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py
@@ -0,0 +1,21 @@
+_base_ = ['./efficientnetv2-b0_8xb32_in1k.py']
+
+# model setting
+model = dict(backbone=dict(arch='b2'), head=dict(in_channels=1408, ))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=208),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=260, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..84fb29a55400a44af414b909c49806381f9564b9
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py
@@ -0,0 +1,21 @@
+_base_ = ['./efficientnetv2-b0_8xb32_in1k.py']
+
+# model setting
+model = dict(backbone=dict(arch='b3'), head=dict(in_channels=1536, ))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=240),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=300, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
new file mode 100644
index 0000000000000000000000000000000000000000..c3606cf07086f6a8f0580183e6f94d9e1950dae3
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    'efficientnetv2-s_8xb32_in1k-384px.py',
+]
+
+# model setting
+model = dict(backbone=dict(arch='l'), )
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=384, crop_padding=0),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=480, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..179c72075f6f5caa4fc551fee0e3462db6dcba18
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py
@@ -0,0 +1,4 @@
+_base_ = ['./efficientnetv2-s_8xb32_in21k.py']
+
+# model setting
+model = dict(backbone=dict(arch='l'), )
diff --git a/configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7bdd9be3b8e45ccb512f86049df482306ad91d9
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    'efficientnetv2-s_8xb32_in1k-384px.py',
+]
+
+# model setting
+model = dict(backbone=dict(arch='m'), )
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=384, crop_padding=0),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=480, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f04d616376aa523526425c595904e64db0214ecc
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py
@@ -0,0 +1,4 @@
+_base_ = ['./efficientnetv2-s_8xb32_in21k.py']
+
+# model setting
+model = dict(backbone=dict(arch='m'), )
diff --git a/configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bdee636a20bf50cff4126cd50087724b7a9072f
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/models/efficientnet_v2/efficientnetv2_s.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=300, crop_padding=0),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=384, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..54f8a5af4eb92f8de1d7e5f488a8b222afda9239
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/efficientnet_v2/efficientnetv2_s.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21843))
+
+# dataset settings
+dataset_type = 'ImageNet21k'
+data_preprocessor = dict(
+    num_classes=21843,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py
new file mode 100644
index 0000000000000000000000000000000000000000..18f56ff063b3dd1eee15f81718cd88cd83eeb9df
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    'efficientnetv2-s_8xb32_in1k-384px.py',
+]
+
+# model setting
+model = dict(backbone=dict(arch='xl'), )
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=384, crop_padding=0),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=512, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2ee84cb32f7b83bf6d950a92088e983063ce049
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py
@@ -0,0 +1,4 @@
+_base_ = ['./efficientnetv2-s_8xb32_in21k.py']
+
+# model setting
+model = dict(backbone=dict(arch='xl'), )
diff --git a/configs/efficientnet_v2/metafile.yml b/configs/efficientnet_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6c927dce99ad0bf9c6e5555c4e9496e2613960d3
--- /dev/null
+++ b/configs/efficientnet_v2/metafile.yml
@@ -0,0 +1,255 @@
+Collections:
+  - Name: EfficientNetV2
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - Average Pooling
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - Inverted Residual Block
+        - RMSProp
+        - Squeeze-and-Excitation Block
+        - Swish
+    Paper:
+      URL: https://arxiv.org/abs/2104.00298
+      Title: "EfficientNetV2: Smaller Models and Faster Training"
+    README: configs/efficientnet_v2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: efficientnetv2-b0_3rdparty_in1k
+    Metadata:
+      FLOPs: 919843360
+      Parameters: 7139704
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.52
+          Top 5 Accuracy: 94.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b0_3rdparty_in1k_20221221-9ef6e736.pth
+    Config: configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b0-c7cc451f.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-b1_3rdparty_in1k
+    Metadata:
+      FLOPs: 1438287552
+      Parameters: 8141052
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.80
+          Top 5 Accuracy: 94.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b1_3rdparty_in1k_20221221-6955d9ce.pth
+    Config: configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b1-be6e41b0.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-b2_3rdparty_in1k
+    Metadata:
+      FLOPs: 1986433080
+      Parameters: 10096086
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.63
+          Top 5 Accuracy: 95.30
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b2_3rdparty_in1k_20221221-74f7d493.pth
+    Config: configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b2-847de54e.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-b3_3rdparty_in1k
+    Metadata:
+      FLOPs: 3498068400
+      Parameters: 14358406
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.03
+          Top 5 Accuracy: 95.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b3_3rdparty_in1k_20221221-b6f07a36.pth
+    Config: configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b3-57773f13.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-s_3rdparty_in1k
+    Metadata:
+      FLOPs: 9719420928
+      Parameters: 21458488
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.82
+          Top 5 Accuracy: 96.67
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in1k_20221220-f0eaff9d.pth
+    Config: configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s-eb54923e.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-m_3rdparty_in1k
+    Metadata:
+      FLOPs: 26880363584
+      Parameters: 54139356
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.01
+          Top 5 Accuracy: 97.26
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in1k_20221220-9dc0c729.pth
+    Config: configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_m-cc09e0cd.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-l_3rdparty_in1k
+    Metadata:
+      FLOPs: 60142387008
+      Parameters: 118515272
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.43
+          Top 5 Accuracy: 97.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in1k_20221220-5c3bac0f.pth
+    Config: configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_l-d664b728.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-s_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 9719420928
+      Parameters: 21458488
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.29
+          Top 5 Accuracy: 97.26
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_in21k-pre-3rdparty_in1k_20221220-7a7c8475.pth
+    Config: configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s_21ft1k-d7dafa41.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-m_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 26880363584
+      Parameters: 54139356
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.47
+          Top 5 Accuracy: 97.76
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_in21k-pre-3rdparty_in1k_20221220-a1013a04.pth
+    Config: configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_m_21ft1k-bf41664a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-l_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 60142387008
+      Parameters: 118515272
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.31
+          Top 5 Accuracy: 97.99
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_in21k-pre-3rdparty_in1k_20221220-63df0efd.pth
+    Config: configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_l_21ft1k-60127a9d.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-xl_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 98341230592
+      Parameters: 208119808
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.39
+          Top 5 Accuracy: 97.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_in21k-pre-3rdparty_in1k_20221220-583ac18b.pth
+    Config: configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_xl_in21ft1k-06c35c48.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-s_3rdparty_in21k
+    Metadata:
+      FLOPs: 3309720768
+      Parameters: 48158371
+    In Collection: EfficientNetV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in21k_20221220-c0572b56.pth
+    Config: configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s_21k-6337ad01.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-m_3rdparty_in21k
+    Metadata:
+      FLOPs: 5861638208
+      Parameters: 80839239
+    In Collection: EfficientNetV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in21k_20221220-073e944c.pth
+    Config: configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_m_21k-361418a2.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-l_3rdparty_in21k
+    Metadata:
+      FLOPs: 13114950464
+      Parameters: 145215155
+    In Collection: EfficientNetV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in21k_20221220-f28f91e1.pth
+    Config: configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_l_21k-91a19ec9.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-xl_3rdparty_in21k
+    Metadata:
+      FLOPs: 18855244288
+      Parameters: 234819691
+    In Collection: EfficientNetV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_3rdparty_in21k_20221220-b2c9329c.pth
+    Config: configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_xl_in21k-fd7e8abf.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
diff --git a/configs/eva/README.md b/configs/eva/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6e49c8abe8e88bc8eb683dd6dcc0ff06faf86f5f
--- /dev/null
+++ b/configs/eva/README.md
@@ -0,0 +1,101 @@
+# EVA
+
+> [EVA: Exploring the Limits of Masked Visual Representation Learning at Scale](https://arxiv.org/abs/2211.07636)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/205410193-f1164e56-c117-4165-86f5-4cbfd797bc87.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                                | Params (M) | Flops (G) |                             Config                              |                              Download                              |
+| :--------------------------------------------------- | :--------: | :-------: | :-------------------------------------------------------------: | :----------------------------------------------------------------: |
+| `eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k` |   111.78   |   17.58   | [config](eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.json) |
+| `beit-l-p14_3rdparty-eva_in21k`\*                    |   303.18   |   81.08   |                 [config](eva-l-p14_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth) |
+| `beit-l-p14_eva-pre_3rdparty_in21k`\*                |   303.18   |   81.08   |                 [config](eva-l-p14_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in21k_20221213-8f194fa2.pth) |
+| `beit-g-p16_3rdparty-eva_30m`\*                      |  1011.32   |  203.52   |                 [config](eva-g-p16_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p16_3rdparty_30m_20221213-7bed23ee.pth) |
+| `beit-g-p14_3rdparty-eva_30m`\*                      |  1011.60   |  267.17   |                 [config](eva-g-p14_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_3rdparty_30m_20221213-3b7aca97.pth) |
+| `beit-g-p14_eva-30m-pre_3rdparty_in21k`\*            |  1011.60   |  267.17   |                 [config](eva-g-p14_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                   |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                  Download                  |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k` | [EVA MAE STYLE](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth) |   86.57    |   17.58   |   83.70   |    N/A    | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.json) |
+| `vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k` | [EVA MAE STYLE](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth) |   86.57    |   17.58   |   69.00   |    N/A    | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.json) |
+| `beit-l-p14_eva-pre_3rdparty_in1k-196px`\* | [EVA](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth) |   304.14   |   61.57   |   87.94   |   98.5    | [config](eva-l-p14_8xb16_in1k-196px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-196px_20221214-2adf4d28.pth) |
+| `beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px`\* |              EVA ImageNet-21k              |   304.14   |   61.57   |   88.58   |   98.65   | [config](eva-l-p14_8xb16_in1k-196px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px_20221213-b730c7e7.pth) |
+| `beit-l-p14_eva-pre_3rdparty_in1k-336px`\* | [EVA](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth) |   304.53   |  191.10   |   88.66   |   98.75   | [config](eva-l-p14_8xb16_in1k-336px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-336px_20221214-07785cfd.pth) |
+| `beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px`\* |              EVA ImageNet-21k              |   304.53   |  191.10   |   89.17   |   98.86   | [config](eva-l-p14_8xb16_in1k-336px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px_20221213-f25b7634.pth) |
+| `beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px`\* | [EVA 30M ImageNet-21k](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth) |  1013.01   |  620.64   |   89.61   |   98.93   | [config](eva-g-p14_8xb16_in1k-336px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px_20221213-210f9071.pth) |
+| `beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px`\* | [EVA 30M ImageNet-21k](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth) |  1014.45   |  1906.76  |   89.71   |   98.96   | [config](eva-g-p14_8xb16_in1k-560px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px_20221213-fa1c3652.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{EVA,
+  title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
+  author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
+  journal={arXiv preprint arXiv:2211.07636},
+  year={2022}
+}
+```
diff --git a/configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a3f4983ac19208090ee63e9c9160b945b22ee6
--- /dev/null
+++ b/configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py b/configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b7333ca475ad1d9607ddda898acb623e1bd7aa4
--- /dev/null
+++ b/configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]),
+    data_preprocessor=dict(
+        num_classes=1000,
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        to_rgb=True,
+    ))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=3.2, weight_decay=0.0, momentum=0.9),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/eva/eva-g-p14_8xb16_in1k-336px.py b/configs/eva/eva-g-p14_8xb16_in1k-336px.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa2bd7ee5be0167c5d69d5f1cc96a069e5f17cb5
--- /dev/null
+++ b/configs/eva/eva-g-p14_8xb16_in1k-336px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/eva/eva-g.py',
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=336))
diff --git a/configs/eva/eva-g-p14_8xb16_in1k-560px.py b/configs/eva/eva-g-p14_8xb16_in1k-560px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed20866b7f0dc19b919a06a71e50a205370194a0
--- /dev/null
+++ b/configs/eva/eva-g-p14_8xb16_in1k-560px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/eva/eva-g.py',
+    '../_base_/datasets/imagenet_bs16_eva_560.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=560))
diff --git a/configs/eva/eva-g-p14_headless.py b/configs/eva/eva-g-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..b278aceab6211c55702c69beb1b396f37064a8b9
--- /dev/null
+++ b/configs/eva/eva-g-p14_headless.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='eva-g',
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva/eva-g-p16_headless.py b/configs/eva/eva-g-p16_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca5de1860f5edb0ee768eb12ce7c528fa17e2a00
--- /dev/null
+++ b/configs/eva/eva-g-p16_headless.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='eva-g',
+        img_size=224,
+        patch_size=16,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva/eva-l-p14_8xb16_in1k-196px.py b/configs/eva/eva-l-p14_8xb16_in1k-196px.py
new file mode 100644
index 0000000000000000000000000000000000000000..3503ca5d78022e29f1c1c945aa1226085f1c3eb6
--- /dev/null
+++ b/configs/eva/eva-l-p14_8xb16_in1k-196px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/eva/eva-l.py',
+    '../_base_/datasets/imagenet_bs16_eva_196.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=196))
diff --git a/configs/eva/eva-l-p14_8xb16_in1k-336px.py b/configs/eva/eva-l-p14_8xb16_in1k-336px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7094df8ba3de0540049eaeb4693ef5b09094dc2b
--- /dev/null
+++ b/configs/eva/eva-l-p14_8xb16_in1k-336px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/eva/eva-l.py',
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=336))
diff --git a/configs/eva/eva-l-p14_headless.py b/configs/eva/eva-l-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..89a4ce10990489daf92e95c1355669f242838ff3
--- /dev/null
+++ b/configs/eva/eva-l-p14_headless.py
@@ -0,0 +1,25 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='l',
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+        layer_cfgs=dict(bias=True),
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py b/configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbedb07c727aaa38c2de9f57fa6cfe9fdbdd87a2
--- /dev/null
+++ b/configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
@@ -0,0 +1,86 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='EVA',
+    backbone=dict(init_cfg=[
+        dict(type='Xavier', distribution='uniform', layer='Linear'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ]),
+    neck=dict(
+        type='MAEPretrainDecoder',
+        predict_feature_dim=512,
+        init_cfg=[
+            dict(type='Xavier', distribution='uniform', layer='Linear'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    head=dict(
+        _delete_=True,
+        type='MIMHead',
+        loss=dict(
+            type='CosineSimilarityLoss', shift_factor=2.0, scale_factor=2.0),
+    ),
+    target_generator=dict(
+        type='CLIPGenerator',
+        tokenizer_path=  # noqa
+        'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+    ),
+    init_cfg=None)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/eva/metafile.yml b/configs/eva/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..dd8dbbf761486532d228bbf3df5ef396b92d4880
--- /dev/null
+++ b/configs/eva/metafile.yml
@@ -0,0 +1,261 @@
+Collections:
+  - Name: EVA
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'EVA: Exploring the Limits of Masked Visual Representation Learning at
+        Scale'
+      URL: https://arxiv.org/abs/2211.07636
+    README: configs/eva/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111776512
+      Training Data: ImageNet-1k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth
+    Config: configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
+    Downstream:
+      - vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
+  - Name: vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth
+    Config: configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.0
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.pth
+    Config: configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
+  - Name: beit-l-p14_eva-pre_3rdparty_in1k-196px
+    Metadata:
+      FLOPs: 61565981696
+      Parameters: 304142312
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 87.94
+          Top 5 Accuracy: 98.5
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-196px_20221214-2adf4d28.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_1k_ft_88p0.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
+    Metadata:
+      FLOPs: 61565981696
+      Parameters: 304142312
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.58
+          Top 5 Accuracy: 98.65
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px_20221213-b730c7e7.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_21k_to_1k_ft_88p6.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-l-p14_3rdparty-eva_in21k
+    Metadata:
+      FLOPs: 81075147776
+      Parameters: 303178752
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth
+    Config: configs/eva/eva-l-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-l-p14_eva-pre_3rdparty_in21k
+      - beit-l-p14_eva-pre_3rdparty_in1k-336px
+      - beit-l-p14_eva-pre_3rdparty_in1k-196px
+  - Name: beit-l-p14_eva-pre_3rdparty_in21k
+    Metadata:
+      FLOPs: 81075147776
+      Parameters: 303178752
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in21k_20221213-8f194fa2.pth
+    Config: configs/eva/eva-l-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_21k_ft.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-l-p14_eva-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 191100916736
+      Parameters: 304531432
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.66
+          Top 5 Accuracy: 98.75
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-336px_20221214-07785cfd.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_1k_ft_88p65.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
+      - beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
+  - Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 191100916736
+      Parameters: 304531432
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.17
+          Top 5 Accuracy: 98.86
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px_20221213-f25b7634.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_21k_to_1k_ft_89p2.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-g-p16_3rdparty-eva_30m
+    Metadata:
+      FLOPs: 203517463424
+      Parameters: 1011315072
+      Training Data:
+        - merged-30M
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p16_3rdparty_30m_20221213-7bed23ee.pth
+    Config: configs/eva/eva-g-p16_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14to16.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-g-p14_3rdparty-eva_30m
+    Metadata:
+      FLOPs: 267174833024
+      Parameters: 1011596672
+      Training Data:
+        - merged-30M
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_3rdparty_30m_20221213-3b7aca97.pth
+    Config: configs/eva/eva-g-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-g-p14_eva-30m-pre_3rdparty_in21k
+  - Name: beit-g-p14_eva-30m-pre_3rdparty_in21k
+    Metadata:
+      FLOPs: 267174833024
+      Parameters: 1011596672
+      Training Data:
+        - merged-30M
+        - ImageNet-21k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth
+    Config: configs/eva/eva-g-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_224px_psz14.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
+      - beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
+  - Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 620642757504
+      Parameters: 1013005672
+      Training Data:
+        - merged-30M
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.61
+          Top 5 Accuracy: 98.93
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px_20221213-210f9071.pth
+    Config: configs/eva/eva-g-p14_8xb16_in1k-336px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_336px_psz14_ema_89p6.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
+    Metadata:
+      FLOPs: 1906761591680
+      Parameters: 1014447464
+      Training Data:
+        - merged-30M
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.71
+          Top 5 Accuracy: 98.96
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px_20221213-fa1c3652.pth
+    Config: configs/eva/eva-g-p14_8xb16_in1k-560px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_560px_psz14_ema_89p7.pt
+      Code: https://github.com/baaivision/EVA
diff --git a/configs/eva02/README.md b/configs/eva02/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bc8f64e76d1601ade6ef052a2f23f7d2f6123843
--- /dev/null
+++ b/configs/eva02/README.md
@@ -0,0 +1,109 @@
+# EVA-02
+
+> [EVA-02: A Visual Representation for Neon Genesis](https://arxiv.org/abs/2303.11331)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set.  Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open accessand open research, we release the complete suite of EVA-02 to the community.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/40905160/229037980-b83dceb5-41d6-406c-a20b-63b83c80136d.png" width="70%" alt="TrV builds upon the original plain ViT architecture and includes several enhancements: SwinGLU FFN, sub-LN, 2D RoPE, and JAX weight initialization. To keep the parameter & FLOPs consistent with the baseline, the FFN hidden dim of SwiGLU is 2/3× of the typical MLP counterpart."/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', pretrained=True)
+inputs = torch.rand(1, 3, 336, 336)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/eva02/eva02-tiny-p14_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/eva02/eva02-tiny-p14_in1k.py /path/to/eva02-tiny-p14_in1k.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                             | Params (M) | Flops (G) |                Config                 |                                                   Download                                                    |
+| :-------------------------------- | :--------: | :-------: | :-----------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
+| `vit-tiny-p14_eva02-pre_in21k`\*  |    5.50    |   1.70    | [config](eva02-tiny-p14_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth)  |
+| `vit-small-p14_eva02-pre_in21k`\* |   21.62    |   6.14    | [config](eva02-small-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth) |
+| `vit-base-p14_eva02-pre_in21k`\*  |   85.77    |   23.22   | [config](eva02-base-p14_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth)  |
+| `vit-large-p14_eva02-pre_in21k`\* |   303.29   |   81.15   | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth) |
+| `vit-large-p14_eva02-pre_m38m`\*  |   303.29   |   81.15   | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth)  |
+
+- The input size / patch size of MIM pre-trained EVA-02 is `224x224` / `14x14`.
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA).*
+
+### Image Classification on ImageNet-1k
+
+#### (*w/o* IN-21K intermediate fine-tuning)
+
+| Model                                                 |      Pretrain      | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                         Download                          |
+| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
+| `vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px`\*  | EVA02 ImageNet-21k |    5.76    |   4.68    |   80.69   |   95.54   | [config](./eva02-tiny-p14_in1k.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth) |
+| `vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px`\* | EVA02 ImageNet-21k |   22.13    |   15.48   |   85.78   |   97.60   | [config](./eva02-small-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth) |
+| `vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px`\*  | EVA02 ImageNet-21k |   87.13    |  107.11   |   88.29   |   98.53   | [config](./eva02-base-p14_in1k.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth) |
+
+*Models with * are converted from the  [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+#### (*w* IN-21K intermediate fine-tuning)
+
+| Model                                                 |      Pretrain      | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                         Download                          |
+| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
+| `vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k |   87.13    |  107.11   |   88.47   |   98.62   | [config](./eva02-base-p14_in1k.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth) |
+| `vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k |   305.08   |  362.33   |   89.65   |   98.95   | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth) |
+| `vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px`\* |  EVA02 Merged-38M  |   305.10   |  362.33   |   89.83   |   99.00   | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth) |
+
+*Models with * are converted from the  [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{EVA-02,
+  title={EVA-02: A Visual Representation for Neon Genesis},
+  author={Yuxin Fang and Quan Sun and Xinggang Wang and Tiejun Huang and Xinlong Wang and Yue Cao},
+  journal={arXiv preprint arXiv:2303.11331},
+  year={2023}
+}
+```
diff --git a/configs/eva02/eva02-base-p14_headless.py b/configs/eva02/eva02-base-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..27aa8f8a502810d39865ee85fd45b5152c8d5269
--- /dev/null
+++ b/configs/eva02/eva02-base-p14_headless.py
@@ -0,0 +1,21 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='b',
+        img_size=224,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva02/eva02-base-p14_in1k.py b/configs/eva02/eva02-base-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8400d38542d71ee5d3f9713e34236bdc0e7783a
--- /dev/null
+++ b/configs/eva02/eva02-base-p14_in1k.py
@@ -0,0 +1,32 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_448.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='b',
+        img_size=448,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/eva02/eva02-large-p14_headless.py b/configs/eva02/eva02-large-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..e101ac977c8590572190350292325c78477dbfd3
--- /dev/null
+++ b/configs/eva02/eva02-large-p14_headless.py
@@ -0,0 +1,21 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='l',
+        img_size=224,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva02/eva02-large-p14_in1k.py b/configs/eva02/eva02-large-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..91a42776dafd0f78ba6f3c1fbe68bfc602ad502e
--- /dev/null
+++ b/configs/eva02/eva02-large-p14_in1k.py
@@ -0,0 +1,32 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_448.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='l',
+        img_size=448,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/eva02/eva02-small-p14_headless.py b/configs/eva02/eva02-small-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..a969819308e9cea449b06ae3533839d72a2b96fe
--- /dev/null
+++ b/configs/eva02/eva02-small-p14_headless.py
@@ -0,0 +1,20 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='s',
+        img_size=224,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva02/eva02-small-p14_in1k.py b/configs/eva02/eva02-small-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a16d92456e39bb1147423682333cd24673133e6
--- /dev/null
+++ b/configs/eva02/eva02-small-p14_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='s',
+        img_size=336,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/eva02/eva02-tiny-p14_headless.py b/configs/eva02/eva02-tiny-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..783d0ea2ebf35df3af8072958322f4f572e36210
--- /dev/null
+++ b/configs/eva02/eva02-tiny-p14_headless.py
@@ -0,0 +1,20 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='t',
+        img_size=224,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva02/eva02-tiny-p14_in1k.py b/configs/eva02/eva02-tiny-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..84e68d7edd92d91689aa501397a9dbe3eba0b8b3
--- /dev/null
+++ b/configs/eva02/eva02-tiny-p14_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='t',
+        img_size=336,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/eva02/metafile.yml b/configs/eva02/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..80acf904fb46e95f0ae52b1ff6fe3cf620cc8ae7
--- /dev/null
+++ b/configs/eva02/metafile.yml
@@ -0,0 +1,199 @@
+Collections:
+  - Name: EVA02
+    Metadata:
+      Architecture:
+        - Rotary Position Embedding
+        - Sub Layer Normalization
+        - SwiGLU
+    Paper:
+      Title: 'EVA-02: A Visual Representation for Neon Genesis'
+      URL: https://arxiv.org/abs/2303.11331
+    README: configs/eva02/README.md
+
+Models:
+  - Name: vit-tiny-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 1703439360
+      Parameters: 5504064
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth
+    Config: configs/eva02/eva02-tiny-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_Ti_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
+  - Name: vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 4675416000
+      Parameters: 5758888
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 80.69
+          Top 5 Accuracy: 95.54
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth
+    Config: configs/eva02/eva02-tiny-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_Ti_pt_in21k_ft_in1k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-small-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 6135404544
+      Parameters: 21624960
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth
+    Config: configs/eva02/eva02-small-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_S_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
+  - Name: vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 15476744064
+      Parameters: 22133608
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.78
+          Top 5 Accuracy: 97.60
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth
+    Config: configs/eva02/eva02-small-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_S_pt_in21k_ft_in1k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-base-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 23216492544
+      Parameters: 85766400
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth
+    Config: configs/eva02/eva02-base-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_B_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
+      - vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+  - Name: vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 107105984256
+      Parameters: 87126760
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.29
+          Top 5 Accuracy: 98.53
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth
+    Config: configs/eva02/eva02-base-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_B_pt_in21k_ft_in1k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 107105984256
+      Parameters: 87126760
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.47
+          Top 5 Accuracy: 98.62
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth
+    Config: configs/eva02/eva02-base-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_B_pt_in21k_medft_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-large-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 81146703792
+      Parameters: 303291328
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth
+    Config: configs/eva02/eva02-large-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+  - Name: vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 362333836208
+      Parameters: 305104808
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.65
+          Top 5 Accuracy: 98.95
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth
+    Config: configs/eva02/eva02-large-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_in21k_medft_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-large-p14_eva02-pre_m38m
+    Metadata:
+      FLOPs: 81146703792
+      Parameters: 303291328
+      Training Data:
+        - Merged-38M
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth
+    Config: configs/eva02/eva02-large-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_m38m_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
+  - Name: vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 362333836208
+      Parameters: 305104808
+      Training Data:
+        - Merged-38M
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.83
+          Top 5 Accuracy: 99.00
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth
+    Config: configs/eva02/eva02-large-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_m38m_medft_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
diff --git a/configs/flamingo/README.md b/configs/flamingo/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..60c6af0f50e43cb0f84d2a3dbd2d343a435c6310
--- /dev/null
+++ b/configs/flamingo/README.md
@@ -0,0 +1,82 @@
+# Flamingo
+
+> [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/236371424-3b9d2e16-3966-4c64-8b87-e33fd6348824.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('flamingo_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'A dog and a cat are looking at each other. '}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/flamingo/flamingo_zeroshot_caption.py https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                                  | Params (G) | CIDER |                 Config                 |                                                   Download                                                    |
+| :------------------------------------- | :--------: | :---: | :------------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
+| `flamingo_3rdparty-zeroshot_caption`\* |   8.220    | 65.50 | [config](flamingo_zeroshot_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
+
+*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Question Answering on VQAv2
+
+| Model                              | Params (G) | Accuracy |               Config               |                                                      Download                                                      |
+| :--------------------------------- | :--------: | :------: | :--------------------------------: | :----------------------------------------------------------------------------------------------------------------: |
+| `flamingo_3rdparty-zeroshot_vqa`\* |    8.22    |  43.50   | [config](flamingo_zeroshot_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
+
+*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Alayrac2022FlamingoAV,
+  title={Flamingo: a Visual Language Model for Few-Shot Learning},
+  author={Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan},
+  journal={ArXiv},
+  year={2022},
+  volume={abs/2204.14198}
+}
+```
+
+```bibtex
+@software{anas_awadalla_2023_7733589,
+  author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig},
+  title = {OpenFlamingo},
+  month        = mar,
+  year         = 2023,
+  publisher    = {Zenodo},
+  version      = {v0.1.1},
+  doi          = {10.5281/zenodo.7733589},
+  url          = {https://doi.org/10.5281/zenodo.7733589}
+}
+```
diff --git a/configs/flamingo/flamingo_fewshot_caption.py b/configs/flamingo/flamingo_fewshot_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6f9c2bfccdfb9617a14fae454af9bf209f3199a
--- /dev/null
+++ b/configs/flamingo/flamingo_fewshot_caption.py
@@ -0,0 +1,95 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='caption',
+    shot_prompt_tmpl='<image>Output:{caption}<|endofchunk|>',
+    final_prompt_tmpl='<image>Output:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0))
+
+# data settings
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(
+        type='ApplyToList',
+        # Flamingo requires to load multiple images during few-shot inference.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='ResizeEdge',
+                scale=224,
+                interpolation='bicubic',
+                backend='pillow'),
+            dict(type='CenterCrop', crop_size=(224, 224)),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption', 'shots'],
+        meta_keys=['image_id']),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/captions_train2014.json',
+        data_prefix=dict(img_path='train2014'),
+        pipeline=test_pipeline,
+        num_shots=2,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/captions_train2014.json')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/flamingo_fewshot_vqa.py b/configs/flamingo/flamingo_fewshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..b85a6989b75b4cd1d7bf585cb83b40add12f104f
--- /dev/null
+++ b/configs/flamingo/flamingo_fewshot_vqa.py
@@ -0,0 +1,109 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='vqa',
+    shot_prompt_tmpl=
+    '<image>Question:{question} Short Answer:{answer}<|endofchunk|>',
+    final_prompt_tmpl='<image>Question:{question} Short Answer:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
+
+# data settings
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(
+        type='ApplyToList',
+        # Flamingo requires to load multiple images during few-shot inference.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='ResizeEdge',
+                scale=224,
+                interpolation='bicubic',
+                backend='pillow'),
+            dict(type='CenterCrop', crop_size=(224, 224)),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+        meta_keys=['image_id']),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline,
+        num_shots=2,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/flamingo_zeroshot_caption.py b/configs/flamingo/flamingo_zeroshot_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..deb786e4d56e70abd26723462068dfb9ad4ed9aa
--- /dev/null
+++ b/configs/flamingo/flamingo_zeroshot_caption.py
@@ -0,0 +1,95 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+zeroshot_prompt = (
+    'Output:A child holding a flowered umbrella and petting a yak.<|endofchunk|>'  # noqa: E501
+    'Output:The child is holding a brush close to his mouth.<|endofchunk|>'  # noqa: E501
+)
+
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='caption',
+    zeroshot_prompt=zeroshot_prompt,
+    final_prompt_tmpl='<image>Output:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption'],
+        meta_keys=['image_id'],
+    ),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/captions_train2014.json',
+        data_prefix=dict(img_path='train2014'),
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/captions_train2014.json')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/flamingo_zeroshot_vqa.py b/configs/flamingo/flamingo_zeroshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..c43c7b8686679364490aa8acf893c61f4c5500f7
--- /dev/null
+++ b/configs/flamingo/flamingo_zeroshot_vqa.py
@@ -0,0 +1,107 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+zeroshot_prompt = (
+    'Question:What is this photo taken looking through? Short Answer:pitcher<|endofchunk|>'  # noqa: E501
+    'Question:How many people are wearing shorts in the forefront of this photo? Short Answer:4<|endofchunk|>'  # noqa: E501
+)
+
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='vqa',
+    zeroshot_prompt=zeroshot_prompt,
+    final_prompt_tmpl='<image>Question:{question} Short Answer:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+        meta_keys=['image_id'],
+    ),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/metafile.yml b/configs/flamingo/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6ff33e93b24ce1e10efb57c7465e9e6663709f97
--- /dev/null
+++ b/configs/flamingo/metafile.yml
@@ -0,0 +1,42 @@
+Collections:
+  - Name: Flamingo
+    Metadata:
+      Architecture:
+        - Transformer
+        - Gated Cross-Attention Dense
+    Paper:
+      Title: 'Flamingo: a Visual Language Model for Few-Shot Learning'
+      URL: https://arxiv.org/abs/2204.14198
+    README: configs/flamingo/README.md
+
+Models:
+  - Name: flamingo_3rdparty-zeroshot_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Flamingo
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          CIDER: 65.50  # Report from the official repo
+    Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+    Config: configs/flamingo/flamingo_zeroshot_caption.py
+    Converted From:
+      Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
+      Code: https://github.com/mlfoundations/open_flamingo
+  - Name: flamingo_3rdparty-zeroshot_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Flamingo
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 43.50  # Report from the official repo
+    Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+    Config: configs/flamingo/flamingo_zeroshot_vqa.py
+    Converted From:
+      Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
+      Code: https://github.com/mlfoundations/open_flamingo
diff --git a/configs/glip/README.md b/configs/glip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..48ee30560a92b8ce3c926f536f625b67cca957c2
--- /dev/null
+++ b/configs/glip/README.md
@@ -0,0 +1,57 @@
+# GLIP
+
+> [Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
+
+<div align="center">
+<img src="https://github.com/microsoft/GLIP/blob/main/docs/lead.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('swin-t_glip-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+<!-- [TABS-END] -->
+
+## Results and models
+
+### Pre-trained models
+
+The pre-trained models are used to fine-tune, and therefore don't have evaluation results.
+
+| Model                                       |          Pretrain          | resolution |                                                       Download                                                        |
+| :------------------------------------------ | :------------------------: | :--------: | :-------------------------------------------------------------------------------------------------------------------: |
+| GLIP-T (`swin-t_glip-pre_3rdparty`)\*       |    O365,GoldG,CC3M,SBU     |  224x224   |    [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth)    |
+| GLIP-L (`swin-l_glip-pre_3rdparty_384px`)\* | FourODs,GoldG,CC3M+12M,SBU |  384x384   | [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/GLIP).*
+
+## Citation
+
+```bibtex
+@inproceedings{li2021grounded,
+      title={Grounded Language-Image Pre-training},
+      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
+      year={2022},
+      booktitle={CVPR},
+}
+```
diff --git a/configs/glip/glip-l_headless.py b/configs/glip/glip-l_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..991b6b85039bf0d24237a617dfeae285f97d7555
--- /dev/null
+++ b/configs/glip/glip-l_headless.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='large',
+        img_size=384,
+        out_indices=(1, 2, 3),  # original weight is for detection
+        stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    # convert image from BGR to RGB
+    to_rgb=False,
+)
diff --git a/configs/glip/glip-t_headless.py b/configs/glip/glip-t_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..08b89f8f1e02a1d1fa230e437e6b6e3ac873821f
--- /dev/null
+++ b/configs/glip/glip-t_headless.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='tiny',
+        img_size=224,
+        out_indices=(1, 2, 3),  # original weight is for detection
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    # convert image from BGR to RGB
+    to_rgb=False,
+)
diff --git a/configs/glip/metafile.yml b/configs/glip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..0691fd0d06c184082718be80d110a52dd9fae06b
--- /dev/null
+++ b/configs/glip/metafile.yml
@@ -0,0 +1,49 @@
+Collections:
+  - Name: GLIP
+    Metadata:
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Architecture:
+        - Shift Window Multihead Self Attention
+    Paper:
+      URL: https://arxiv.org/abs/2112.03857
+      Title: "Grounded Language-Image Pre-training"
+    README: configs/glip/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/vit.py
+      Version: v1.0.0rc8
+
+Models:
+  - Name: swin-t_glip-pre_3rdparty
+    In Collection: GLIP
+    Metadata:
+      FLOPs: 4508464128
+      Parameters: 29056354
+      Training Data:
+        - O365
+        - GoldG
+        - CC3M
+        - SBU
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth
+    Converted From:
+      Weights: https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_tiny_model_o365_goldg_cc_sbu.pth
+      Code: https://github.com/microsoft/GLIP
+    Config: configs/glip/glip-t_headless.py
+  - Name: swin-l_glip-pre_3rdparty_384px
+    In Collection: GLIP
+    Metadata:
+      FLOPs: 104080343040
+      Parameters: 196735516
+      Training Data:
+        - FourODs
+        - GoldG
+        - CC3M+12M
+        - SBU
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth
+    Converted From:
+      Weights: https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_large_model.pth
+      Code: https://github.com/microsoft/GLIP
+    Config: configs/glip/glip-l_headless.py
diff --git a/configs/hivit/README.md b/configs/hivit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..18ae0862c5db52a7f6f82451d398ee3e47d709ce
--- /dev/null
+++ b/configs/hivit/README.md
@@ -0,0 +1,81 @@
+# HiViT
+
+> [HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer](https://arxiv.org/abs/2205.14949)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage. Empirical studies demonstrate the advantageous performance of HiViT in terms of fully-supervised, self-supervised, and transfer learning. In particular, in running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$\times$ speed-up over Swin-B, and the performance gain generalizes to downstream tasks of detection and segmentation. Code will be made publicly available.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/4a99cf9d-15df-4866-8750-bd2c3db5d894" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+<!-- **Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('hivit-tiny-p16_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+<!-- **Use the model** -->
+
+<!-- ```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('hivit-tiny-p16_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+``` -->
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/hivit/hivit-tiny-p16_16xb64_in1k.py
+```
+
+<!-- Test:
+
+```shell
+python tools/test.py configs/hivit/hivit-tiny-p16_16xb64_in1k.py None
+``` -->
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) |                  Config                  | Download |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :--------------------------------------: | :------: |
+| `hivit-tiny-p16_16xb64_in1k`  | From scratch |   19.18    |   4.60    |   82.10   | [config](hivit-tiny-p16_16xb64_in1k.py)  |   N/A    |
+| `hivit-small-p16_16xb64_in1k` | From scratch |   37.53    |   9.07    |    N/A    | [config](hivit-small-p16_16xb64_in1k.py) |   N/A    |
+| `hivit-base-p16_16xb64_in1k`  | From scratch |   79.05    |   18.47   |    N/A    | [config](hivit-base-p16_16xb64_in1k.py)  |   N/A    |
+
+## Citation
+
+```bibtex
+@inproceedings{zhanghivit,
+  title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
+  author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
+  booktitle={International Conference on Learning Representations},
+  year={2023},
+}
+```
diff --git a/configs/hivit/hivit-base-p16_16xb64_in1k.py b/configs/hivit/hivit-base-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d37dcda86ba8db69cea47477f240e24564fcf91f
--- /dev/null
+++ b/configs/hivit/hivit-base-p16_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/hivit/base_224.py',
+    '../_base_/datasets/imagenet_bs64_hivit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_hivit.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/hivit/hivit-small-p16_16xb64_in1k.py b/configs/hivit/hivit-small-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fa3976672e839354c8a215ded9a02874ab78aca
--- /dev/null
+++ b/configs/hivit/hivit-small-p16_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/hivit/small_224.py',
+    '../_base_/datasets/imagenet_bs64_hivit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_hivit.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/hivit/hivit-tiny-p16_16xb64_in1k.py b/configs/hivit/hivit-tiny-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ed3b6a7ae95a232995c50d26002fd6d5aa0fbe1
--- /dev/null
+++ b/configs/hivit/hivit-tiny-p16_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/hivit/tiny_224.py',
+    '../_base_/datasets/imagenet_bs64_hivit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_hivit.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/hivit/metafile.yml b/configs/hivit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..67f3a6961637a1a43f64063bdcdd567c163ab3df
--- /dev/null
+++ b/configs/hivit/metafile.yml
@@ -0,0 +1,63 @@
+Collections:
+  - Name: HiViT
+    Metadata:
+      Architecture:
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer'
+      URL: https://arxiv.org/abs/2205.14949
+    README: configs/hivit/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: hivit-tiny-p16_16xb64_in1k
+    Metadata:
+      FLOPs: 4603000000
+      Parameters: 19181000
+      Training Data:
+        - ImageNet-1k
+    In Collection: HiViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.1
+        Task: Image Classification
+    Weights:
+    Config: configs/hivit/hivit-tiny-p16_16xb64_in1k.py
+
+  - Name: hivit-small-p16_16xb64_in1k
+    Metadata:
+      FLOPs: 9072000000
+      Parameters: 37526000
+      Training Data:
+        - ImageNet-1k
+    In Collection: HiViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy:
+        Task: Image Classification
+    Weights:
+    Config: configs/hivit/hivit-small-p16_16xb64_in1k.py
+
+  - Name: hivit-base-p16_16xb64_in1k
+    Metadata:
+      FLOPs: 18474000000
+      Parameters: 79051000
+      Training Data:
+        - ImageNet-1k
+    In Collection: HiViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy:
+        Task: Image Classification
+    Weights:
+    Config: configs/hivit/hivit-base-p16_16xb64_in1k.py
diff --git a/configs/hornet/README.md b/configs/hornet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b4dbf05bd35d4cfc0fc165ea857110e18ace664c
--- /dev/null
+++ b/configs/hornet/README.md
@@ -0,0 +1,80 @@
+# HorNet
+
+> [HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions](https://arxiv.org/abs/2207.14284)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. g nConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual encoders, we also show g nConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that g nConv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/188356236-b8e3db94-eaa6-48e9-b323-15e5ba7f2991.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('hornet-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('hornet-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/hornet/hornet-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny_3rdparty_in1k_20220915-0e8eedff.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                 Config                  |                                    Download                                     |
+| :-------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-------------------------------------: | :-----------------------------------------------------------------------------: |
+| `hornet-tiny_3rdparty_in1k`\*     | From scratch |   22.41    |   3.98    |   82.84   |   96.24   |  [config](hornet-tiny_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny_3rdparty_in1k_20220915-0e8eedff.pth) |
+| `hornet-tiny-gf_3rdparty_in1k`\*  | From scratch |   22.99    |   3.90    |   82.98   |   96.38   | [config](hornet-tiny-gf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny-gf_3rdparty_in1k_20220915-4c35a66b.pth) |
+| `hornet-small_3rdparty_in1k`\*    | From scratch |   49.53    |   8.83    |   83.79   |   96.75   |  [config](hornet-small_8xb64_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small_3rdparty_in1k_20220915-5935f60f.pth) |
+| `hornet-small-gf_3rdparty_in1k`\* | From scratch |   50.40    |   8.71    |   83.98   |   96.77   | [config](hornet-small-gf_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small-gf_3rdparty_in1k_20220915-649ca492.pth) |
+| `hornet-base_3rdparty_in1k`\*     | From scratch |   87.26    |   15.58   |   84.24   |   96.94   |   [config](hornet-base_8xb64_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base_3rdparty_in1k_20220915-a06176bb.pth) |
+| `hornet-base-gf_3rdparty_in1k`\*  | From scratch |   88.42    |   15.42   |   84.32   |   96.95   | [config](hornet-base-gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base-gf_3rdparty_in1k_20220915-82c06fa7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/raoyongming/HorNet). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{rao2022hornet,
+  title={HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions},
+  author={Rao, Yongming and Zhao, Wenliang and Tang, Yansong and Zhou, Jie and Lim, Ser-Lam and Lu, Jiwen},
+  journal={arXiv preprint arXiv:2207.14284},
+  year={2022}
+}
+```
diff --git a/configs/hornet/hornet-base-gf_8xb64_in1k.py b/configs/hornet/hornet-base-gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b27012df51b4bc90303d5c30df83fb24a2d76690
--- /dev/null
+++ b/configs/hornet/hornet-base-gf_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-base-gf.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=1.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-base_8xb64_in1k.py b/configs/hornet/hornet-base_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb78a7ddaac26bcde4032c8342de251c3c26fb68
--- /dev/null
+++ b/configs/hornet/hornet-base_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=5.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-small-gf_8xb64_in1k.py b/configs/hornet/hornet-small-gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..96fcc77d8ca1f693f479f795e97469240f4632c3
--- /dev/null
+++ b/configs/hornet/hornet-small-gf_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-small-gf.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=1.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-small_8xb64_in1k.py b/configs/hornet/hornet-small_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0535ade00cdff0c4a25e6570a1316216f6fd37b
--- /dev/null
+++ b/configs/hornet/hornet-small_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-small.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=5.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-tiny-gf_8xb128_in1k.py b/configs/hornet/hornet-tiny-gf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3556de9c15ccb29b98fe1a7b68ee59cbbf320536
--- /dev/null
+++ b/configs/hornet/hornet-tiny-gf_8xb128_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-tiny-gf.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=128)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=1.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-tiny_8xb128_in1k.py b/configs/hornet/hornet-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..31bd1dd3fc9c4918c3043916fc155f9eb7faad1d
--- /dev/null
+++ b/configs/hornet/hornet-tiny_8xb128_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=128)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=100.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/metafile.yml b/configs/hornet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..eba0ed2f4c9ac8eb758f5f5a81d023440ae53484
--- /dev/null
+++ b/configs/hornet/metafile.yml
@@ -0,0 +1,115 @@
+Collections:
+  - Name: HorNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Architecture:
+        - HorNet
+        - gnConv
+    Paper:
+      URL: https://arxiv.org/abs/2207.14284
+      Title: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions"
+    README: configs/hornet/README.md
+    Code:
+      Version: v0.24.0
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.24.0/mmcls/models/backbones/hornet.py
+
+Models:
+  - Name: hornet-tiny_3rdparty_in1k
+    Metadata:
+      FLOPs: 3976156352   # 3.98G
+      Parameters: 22409512      # 22.41M
+    In Collection: HorNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.84
+          Top 5 Accuracy: 96.24
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny_3rdparty_in1k_20220915-0e8eedff.pth
+    Config: configs/hornet/hornet-tiny_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/1ca970586c6043709a3f/?dl=1
+  - Name: hornet-tiny-gf_3rdparty_in1k
+    Metadata:
+      FLOPs: 3896472160   # 3.9G
+      Parameters: 22991848      # 22.99M
+    In Collection: HorNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.98
+          Top 5 Accuracy: 96.38
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny-gf_3rdparty_in1k_20220915-4c35a66b.pth
+    Config: configs/hornet/hornet-tiny-gf_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/511faad0bde94dfcaa54/?dl=1
+  - Name: hornet-small_3rdparty_in1k
+    Metadata:
+      FLOPs:  8825621280    # 8.83G
+      Parameters: 49528264          # 49.53M
+    In Collection: HorNet
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 83.79
+            Top 5 Accuracy: 96.75
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small_3rdparty_in1k_20220915-5935f60f.pth
+    Config: configs/hornet/hornet-small_8xb64_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/46422799db2941f7b684/?dl=1
+  - Name: hornet-small-gf_3rdparty_in1k
+    Metadata:
+      FLOPs:  8706094992    # 8.71G
+      Parameters: 50401768          # 50.4M
+    In Collection: HorNet
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 83.98
+            Top 5 Accuracy: 96.77
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small-gf_3rdparty_in1k_20220915-649ca492.pth
+    Config: configs/hornet/hornet-small-gf_8xb64_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/8405c984bf084d2ba85a/?dl=1
+  - Name: hornet-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 15582677376              # 15.59G
+      Parameters: 87256680            # 87.26M
+    In Collection: HorNet
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 84.24
+            Top 5 Accuracy: 96.94
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base_3rdparty_in1k_20220915-a06176bb.pth
+    Config: configs/hornet/hornet-base_8xb64_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/5c86cb3d655d4c17a959/?dl=1
+  - Name: hornet-base-gf_3rdparty_in1k
+    Metadata:
+      FLOPs: 15423308992              # 15.42G
+      Parameters: 88421352            # 88.42M
+    In Collection: HorNet
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 84.32
+            Top 5 Accuracy: 96.95
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base-gf_3rdparty_in1k_20220915-82c06fa7.pth
+    Config: configs/hornet/hornet-base-gf_8xb64_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/6c84935e63b547f383fb/?dl=1
diff --git a/configs/hrnet/README.md b/configs/hrnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..31725cf8a4e062552fcc7a0be60562885944924c
--- /dev/null
+++ b/configs/hrnet/README.md
@@ -0,0 +1,85 @@
+# HRNet
+
+> [Deep High-Resolution Representation Learning for Visual Recognition](https://arxiv.org/abs/1908.07919v2)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions *in series* (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams *in parallel*; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/149920446-cbe05670-989d-4fe6-accc-df20ae2984eb.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('hrnet-w18_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('hrnet-w18_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/hrnet/hrnet-w18_4xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                  |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |              Config               |                                     Download                                     |
+| :------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-------------------------------: | :------------------------------------------------------------------------------: |
+| `hrnet-w18_3rdparty_8xb32_in1k`\*      | From scratch |   21.30    |   4.33    |   76.75   |   93.44   | [config](hrnet-w18_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth) |
+| `hrnet-w30_3rdparty_8xb32_in1k`\*      | From scratch |   37.71    |   8.17    |   78.19   |   94.22   | [config](hrnet-w30_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w30_3rdparty_8xb32_in1k_20220120-8aa3832f.pth) |
+| `hrnet-w32_3rdparty_8xb32_in1k`\*      | From scratch |   41.23    |   8.99    |   78.44   |   94.19   | [config](hrnet-w32_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w32_3rdparty_8xb32_in1k_20220120-c394f1ab.pth) |
+| `hrnet-w40_3rdparty_8xb32_in1k`\*      | From scratch |   57.55    |   12.77   |   78.94   |   94.47   | [config](hrnet-w40_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w40_3rdparty_8xb32_in1k_20220120-9a2dbfc5.pth) |
+| `hrnet-w44_3rdparty_8xb32_in1k`\*      | From scratch |   67.06    |   14.96   |   78.88   |   94.37   | [config](hrnet-w44_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w44_3rdparty_8xb32_in1k_20220120-35d07f73.pth) |
+| `hrnet-w48_3rdparty_8xb32_in1k`\*      | From scratch |   77.47    |   17.36   |   79.32   |   94.52   | [config](hrnet-w48_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32_in1k_20220120-e555ef50.pth) |
+| `hrnet-w64_3rdparty_8xb32_in1k`\*      | From scratch |   128.06   |   29.00   |   79.46   |   94.65   | [config](hrnet-w64_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w64_3rdparty_8xb32_in1k_20220120-19126642.pth) |
+| `hrnet-w18_3rdparty_8xb32-ssld_in1k`\* | From scratch |   21.30    |   4.33    |   81.06   |   95.70   | [config](hrnet-w18_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32-ssld_in1k_20220120-455f69ea.pth) |
+| `hrnet-w48_3rdparty_8xb32-ssld_in1k`\* | From scratch |   77.47    |   17.36   |   83.63   |   96.79   | [config](hrnet-w48_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32-ssld_in1k_20220120-d0459c38.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/HRNet/HRNet-Image-Classification). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{WangSCJDZLMTWLX19,
+  title={Deep High-Resolution Representation Learning for Visual Recognition},
+  author={Jingdong Wang and Ke Sun and Tianheng Cheng and
+          Borui Jiang and Chaorui Deng and Yang Zhao and Dong Liu and Yadong Mu and
+          Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
+  journal={TPAMI},
+  year={2019}
+}
+```
diff --git a/configs/hrnet/hrnet-w18_4xb32_in1k.py b/configs/hrnet/hrnet-w18_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bc329a7e050131b01305d0209cc087c8f2daa24
--- /dev/null
+++ b/configs/hrnet/hrnet-w18_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w18.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w30_4xb32_in1k.py b/configs/hrnet/hrnet-w30_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..669a66b8cc7af8b8b394dba3f915f184e3b9d28f
--- /dev/null
+++ b/configs/hrnet/hrnet-w30_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w30.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w32_4xb32_in1k.py b/configs/hrnet/hrnet-w32_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e487403ffd242f4886962237a5bbfd57d6bbd62
--- /dev/null
+++ b/configs/hrnet/hrnet-w32_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w32.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w40_4xb32_in1k.py b/configs/hrnet/hrnet-w40_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1866a2a2b93d49164ebc8892342d11781a1ba9a5
--- /dev/null
+++ b/configs/hrnet/hrnet-w40_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w40.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w44_4xb32_in1k.py b/configs/hrnet/hrnet-w44_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ec913f7188151ea913f7ba324dc31845b1e9c11
--- /dev/null
+++ b/configs/hrnet/hrnet-w44_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w44.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w48_4xb32_in1k.py b/configs/hrnet/hrnet-w48_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0fc3f18ff03fafba4ff24d510546b6b0434c76c4
--- /dev/null
+++ b/configs/hrnet/hrnet-w48_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w48.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w64_4xb32_in1k.py b/configs/hrnet/hrnet-w64_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..659b3cd23ef16d953dc181d83016f955cd1570e0
--- /dev/null
+++ b/configs/hrnet/hrnet-w64_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w64.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/metafile.yml b/configs/hrnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..3a17b1251333c17b3b1c7834b46d15b4c43b8bd3
--- /dev/null
+++ b/configs/hrnet/metafile.yml
@@ -0,0 +1,162 @@
+Collections:
+  - Name: HRNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Batch Normalization
+        - Convolution
+        - ReLU
+        - Residual Connection
+    Paper:
+      URL: https://arxiv.org/abs/1908.07919v2
+      Title: "Deep High-Resolution Representation Learning for Visual Recognition"
+    README: configs/hrnet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/hrnet.py
+      Version: v0.20.1
+
+Models:
+  - Name: hrnet-w18_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 4330397932
+      Parameters: 21295164
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.75
+          Top 5 Accuracy: 93.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth
+    Config: configs/hrnet/hrnet-w18_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33cMkPimlmClRvmpw
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w30_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 8168305684
+      Parameters: 37708380
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.19
+          Top 5 Accuracy: 94.22
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w30_3rdparty_8xb32_in1k_20220120-8aa3832f.pth
+    Config: configs/hrnet/hrnet-w30_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33cQoACCEfrzcSaVI
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w32_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 8986267584
+      Parameters: 41228840
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.44
+          Top 5 Accuracy: 94.19
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w32_3rdparty_8xb32_in1k_20220120-c394f1ab.pth
+    Config: configs/hrnet/hrnet-w32_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33dYBMemi9xOUFR0w
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w40_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 12767574064
+      Parameters: 57553320
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.94
+          Top 5 Accuracy: 94.47
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w40_3rdparty_8xb32_in1k_20220120-9a2dbfc5.pth
+    Config: configs/hrnet/hrnet-w40_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33ck0gvo5jfoWBOPo
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w44_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 14963902632
+      Parameters: 67061144
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.88
+          Top 5 Accuracy: 94.37
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w44_3rdparty_8xb32_in1k_20220120-35d07f73.pth
+    Config: configs/hrnet/hrnet-w44_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33czZQ0woUb980gRs
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w48_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 17364014752
+      Parameters: 77466024
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.32
+          Top 5 Accuracy: 94.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32_in1k_20220120-e555ef50.pth
+    Config: configs/hrnet/hrnet-w48_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w64_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 29002298752
+      Parameters: 128056104
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.46
+          Top 5 Accuracy: 94.65
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w64_3rdparty_8xb32_in1k_20220120-19126642.pth
+    Config: configs/hrnet/hrnet-w64_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33gQbJsUPTIj3rQu99
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w18_3rdparty_8xb32-ssld_in1k
+    Metadata:
+      FLOPs: 4330397932
+      Parameters: 21295164
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.06
+          Top 5 Accuracy: 95.7
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32-ssld_in1k_20220120-455f69ea.pth
+    Config: configs/hrnet/hrnet-w18_4xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/HRNet/HRNet-Image-Classification/releases/download/PretrainedWeights/HRNet_W18_C_ssld_pretrained.pth
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w48_3rdparty_8xb32-ssld_in1k
+    Metadata:
+      FLOPs: 17364014752
+      Parameters: 77466024
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.63
+          Top 5 Accuracy: 96.79
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32-ssld_in1k_20220120-d0459c38.pth
+    Config: configs/hrnet/hrnet-w48_4xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/HRNet/HRNet-Image-Classification/releases/download/PretrainedWeights/HRNet_W48_C_ssld_pretrained.pth
+      Code: https://github.com/HRNet/HRNet-Image-Classification
diff --git a/configs/inception_v3/README.md b/configs/inception_v3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..24fde38118de66a642938d4d23f95ed5e5bfb412
--- /dev/null
+++ b/configs/inception_v3/README.md
@@ -0,0 +1,76 @@
+# Inception V3
+
+> [Rethinking the Inception Architecture for Computer Vision](http://arxiv.org/abs/1512.00567)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error on the validation set (3.6% error on the test set) and 17.3% top-1 error on the validation set.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/177241797-c103eff4-79bb-414d-aef6-eac323b65a50.png" width="40%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('inception-v3_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('inception-v3_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/inception_v3/inception-v3_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                    Download                                     |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :-----------------------------------------------------------------------------: |
+| `inception-v3_3rdparty_8xb32_in1k`\* | From scratch |   23.83    |   5.75    |   77.57   |   93.58   | [config](inception-v3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/inception.py#L28). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{szegedy2016rethinking,
+  title={Rethinking the inception architecture for computer vision},
+  author={Szegedy, Christian and Vanhoucke, Vincent and Ioffe, Sergey and Shlens, Jon and Wojna, Zbigniew},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={2818--2826},
+  year={2016}
+}
+```
diff --git a/configs/inception_v3/inception-v3_8xb32_in1k.py b/configs/inception_v3/inception-v3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac977f4edbeca55afc3de118162b95cf47f7c15e
--- /dev/null
+++ b/configs/inception_v3/inception-v3_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/inception_v3.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py',
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=299),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=342, edge='short'),
+    dict(type='CenterCrop', crop_size=299),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/inception_v3/metafile.yml b/configs/inception_v3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..0b556deccf0d4ed4bc096d59338da061190ae62f
--- /dev/null
+++ b/configs/inception_v3/metafile.yml
@@ -0,0 +1,37 @@
+Collections:
+  - Name: Inception V3
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - Inception
+    Paper:
+      URL: http://arxiv.org/abs/1512.00567
+      Title: "Rethinking the Inception Architecture for Computer Vision"
+    README: configs/inception_v3/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/inception_v3/metafile.yml
+      Version: v1.0.0rc1
+
+Models:
+  - Name: inception-v3_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 5745177632
+      Parameters: 23834568
+    In Collection: Inception V3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.57
+          Top 5 Accuracy: 93.58
+    Weights: https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth
+    Config: configs/inception_v3/inception-v3_8xb32_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/inception_v3_google-0cc3c7bd.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/inception.py#L28
diff --git a/configs/itpn/README.md b/configs/itpn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..93200d0224b64158961f68f9c0fcea0e4fb1da59
--- /dev/null
+++ b/configs/itpn/README.md
@@ -0,0 +1,65 @@
+# iTPN
+
+> [Integrally Pre-Trained Transformer Pyramid Networks](https://arxiv.org/abs/2211.12735)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2%/87.8% top-1 accuracy on ImageNet-1K, a 53.2%/55.6% box AP on COCO object detection with 1x training schedule using Mask-RCNN, and a 54.7%/57.7% mIoU on ADE20K semantic segmentation using UPerHead -- all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code and the pre-trained models will be released at https://github.com/sunsmarterjie/iTPN.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/2e53d5b5-300e-4640-8507-c1173965ca62" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+<!-- **Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+``` -->
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                                   | Params (M) | Flops (G) |                               Config                               | Download |
+| :------------------------------------------------------ | :--------: | :-------: | :----------------------------------------------------------------: | :------: |
+| `itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k` |   233.00   |   18.47   | [config](itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py) |   N/A    |
+| `itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k`  |   103.00   |   18.47   | [config](itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py)  |   N/A    |
+| `itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k` |   314.00   |   63.98   | [config](itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py) |   N/A    |
+
+## Citation
+
+```bibtex
+@article{tian2022integrally,
+  title={Integrally Pre-Trained Transformer Pyramid Networks},
+  author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
+  journal={arXiv preprint arXiv:2211.12735},
+  year={2022}
+}
+```
diff --git a/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..40f35d9486e7b532dfd4904d94d379167222b62f
--- /dev/null
+++ b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,84 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_itpn.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='iTPN',
+    backbone=dict(
+        type='iTPNHiViT',
+        arch='base',
+        drop_path_rate=0.0,
+        rpe=True,
+        layer_scale_init_value=0.1,
+        reconstruction_type='clip'),
+    neck=dict(
+        type='iTPNPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        mlp_ratio=4.,
+        reconstruction_type='clip',
+        #  transformer pyramid
+        fpn_dim=256,
+        fpn_depth=2,
+        num_outs=3,
+    ),
+    head=dict(
+        type='iTPNClipHead',
+        embed_dims=512,
+        num_embed=512,
+        loss=dict(type='CosineSimilarityLoss')),
+    target_generator=dict(
+        type='CLIPGenerator',
+        tokenizer_path=  # noqa
+        'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+    ),
+)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.98), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c624e7302924ea544ff2e347966956c4652e4f5
--- /dev/null
+++ b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py
@@ -0,0 +1,84 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_itpn.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='iTPN',
+    backbone=dict(
+        type='iTPNHiViT',
+        arch='base',
+        drop_path_rate=0.1,
+        rpe=True,
+        layer_scale_init_value=0.1,
+        reconstruction_type='clip'),
+    neck=dict(
+        type='iTPNPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        mlp_ratio=4.,
+        reconstruction_type='clip',
+        #  transformer pyramid
+        fpn_dim=256,
+        fpn_depth=2,
+        num_outs=3,
+    ),
+    head=dict(
+        type='iTPNClipHead',
+        embed_dims=512,
+        num_embed=512,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='CLIPGenerator',
+        tokenizer_path=  # noqa
+        'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+    ),
+)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 800/1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d324a448fae9edd36fdcfa48c65829fa24a1be51
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c489dda9321774829fd5bf6e56de65603e177c6a
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebc5be011a816d23fb0d6ce801d43fd8f4019ae7
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..359191bc84599016e33b7228a136a06db832b9ea
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='iTPNHiViT', arch='large'),
+    neck=dict(type='iTPNPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca4ba00b23789e1b31e57bb6d1078498a9375f7a
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='iTPNHiViT', arch='large'),
+    neck=dict(type='iTPNPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b1e298b0b97db3c4391dcda5adac4e01438fdfc9
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='iTPNHiViT', arch='large'),
+    neck=dict(type='iTPNPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/metafile.yml b/configs/itpn/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..b8f5844de10f3df4114ba9eb655ed5baf844cb0e
--- /dev/null
+++ b/configs/itpn/metafile.yml
@@ -0,0 +1,50 @@
+Collections:
+  - Name: iTPN
+    Metadata:
+      Architecture:
+        - Dense Connections
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'Integrally Pre-Trained Transformer Pyramid Networks'
+      URL: https://arxiv.org/abs/2211.12735
+    README: configs/itpn/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k
+    Metadata:
+      FLOPs: 18474000000
+      Parameters: 233000000
+      Training Data:
+        - ImageNet-1k
+    In Collection: iTPN
+    Results: null
+    Weights:
+    Config: configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py
+
+  - Name: itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      FLOPs: 18474000000
+      Parameters: 103000000
+      Training Data:
+        - ImageNet-1k
+    In Collection: iTPN
+    Results: null
+    Weights:
+    Config: configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
+
+  - Name: itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      FLOPs: 63977000000
+      Parameters: 314000000
+      Training Data:
+        - ImageNet-1k
+    In Collection: iTPN
+    Results: null
+    Weights:
+    Config: configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
diff --git a/configs/lenet/README.md b/configs/lenet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2cd68eac42ed7fa1d0167fe1f7b9ad917e5ce735
--- /dev/null
+++ b/configs/lenet/README.md
@@ -0,0 +1,28 @@
+# LeNet
+
+> [Backpropagation Applied to Handwritten Zip Code Recognition](https://ieeexplore.ieee.org/document/6795724)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142561080-cd1c4bdc-8739-46ca-bc32-76d462a32901.png" width="50%"/>
+</div>
+
+## Citation
+
+```
+@ARTICLE{6795724,
+  author={Y. {LeCun} and B. {Boser} and J. S. {Denker} and D. {Henderson} and R. E. {Howard} and W. {Hubbard} and L. D. {Jackel}},
+  journal={Neural Computation},
+  title={Backpropagation Applied to Handwritten Zip Code Recognition},
+  year={1989},
+  volume={1},
+  number={4},
+  pages={541-551},
+  doi={10.1162/neco.1989.1.4.541}}
+}
+```
diff --git a/configs/lenet/lenet5_mnist.py b/configs/lenet/lenet5_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ae8192548626c0073228a827d6b6b6595730a5e
--- /dev/null
+++ b/configs/lenet/lenet5_mnist.py
@@ -0,0 +1,89 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='LeNet5', num_classes=10),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# dataset settings
+dataset_type = 'MNIST'
+data_preprocessor = dict(mean=[33.46], std=[78.87], num_classes=10)
+
+pipeline = [dict(type='Resize', scale=32), dict(type='PackInputs')]
+
+common_data_cfg = dict(
+    type=dataset_type, data_prefix='data/mnist', pipeline=pipeline)
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=2,
+    dataset=dict(**common_data_cfg, test_mode=False),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=2,
+    dataset=dict(**common_data_cfg, test_mode=True),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))
+
+param_scheduler = dict(
+    type='MultiStepLR',  # learning policy, decay on several milestones.
+    by_epoch=True,  # update based on epoch.
+    milestones=[15],  # decay at the 15th epochs.
+    gamma=0.1,  # decay to 0.1 times.
+)
+
+train_cfg = dict(by_epoch=True, max_epochs=5, val_interval=1)  # train 5 epochs
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+default_scope = 'mmpretrain'
+
+default_hooks = dict(
+    # record the time of every iteration.
+    timer=dict(type='IterTimerHook'),
+    # print log every 150 iterations.
+    logger=dict(type='LoggerHook', interval=150),
+    # enable the parameter scheduler.
+    param_scheduler=dict(type='ParamSchedulerHook'),
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1),
+    # set sampler seed in distributed evrionment.
+    sampler_seed=dict(type='DistSamplerSeedHook'),
+)
+
+env_cfg = dict(
+    # disable cudnn benchmark
+    cudnn_benchmark=False,
+    # set multi process parameters
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+    # set distributed parameters
+    dist_cfg=dict(backend='nccl'),
+)
+
+log_level = 'INFO'
+
+# load from which checkpoint
+load_from = None
+
+# whether to resume the training of the checkpoint
+resume_from = None
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (1 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/levit/README.md b/configs/levit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..234edb60618b3edd61cb01c0c172513011b1b042
--- /dev/null
+++ b/configs/levit/README.md
@@ -0,0 +1,81 @@
+# LeViT
+
+> [LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference](https://arxiv.org/abs/2104.01136)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU.
+
+<div align=center>
+<img src="https://raw.githubusercontent.com/facebookresearch/LeViT/main/.github/levit.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('levit-128s_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('levit-128s_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/levit/levit-128s_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/levit/levit-128s_3rdparty_in1k_20230117-e9fbd209.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                        |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                                         Download                                         |
+| :--------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :--------------------------------------------------------------------------------------: |
+| `levit-128s_3rdparty_in1k`\* | From scratch |    7.39    |   0.31    |   76.51   |   92.90   | [config](levit-128s_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-128s_3rdparty_in1k_20230117-e9fbd209.pth) |
+| `levit-128_3rdparty_in1k`\*  | From scratch |    8.83    |   0.41    |   78.58   |   93.95   | [config](levit-128_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-128_3rdparty_in1k_20230117-3be02a02.pth) |
+| `levit-192_3rdparty_in1k`\*  | From scratch |   10.56    |   0.67    |   79.86   |   94.75   | [config](levit-192_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-192_3rdparty_in1k_20230117-8217a0f9.pth) |
+| `levit-256_3rdparty_in1k`\*  | From scratch |   18.38    |   1.14    |   81.59   |   95.46   | [config](levit-256_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-256_3rdparty_in1k_20230117-5ae2ce7d.pth) |
+| `levit-384_3rdparty_in1k`\*  | From scratch |   38.36    |   2.37    |   82.59   |   95.95   | [config](levit-384_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-384_3rdparty_in1k_20230117-f3539cce.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/LeViT). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@InProceedings{Graham_2021_ICCV,
+    author    = {Graham, Benjamin and El-Nouby, Alaaeldin and Touvron, Hugo and Stock, Pierre and Joulin, Armand and Jegou, Herve and Douze, Matthijs},
+    title     = {LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference},
+    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+    month     = {October},
+    year      = {2021},
+    pages     = {12259-12269}
+}
+```
diff --git a/configs/levit/deploy/levit-128_8xb256_in1k.py b/configs/levit/deploy/levit-128_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab58119395339cc11a4cb09caad1ea0cb6c7ae3b
--- /dev/null
+++ b/configs/levit/deploy/levit-128_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-128_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-128s_8xb256_in1k.py b/configs/levit/deploy/levit-128s_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..93ebc3724714b73b362bc12de1b9029040cbc4f6
--- /dev/null
+++ b/configs/levit/deploy/levit-128s_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-128s_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-192_8xb256_in1k.py b/configs/levit/deploy/levit-192_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..34249fda74d97b4f1e591cd39722b9cbdd94d3d2
--- /dev/null
+++ b/configs/levit/deploy/levit-192_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-192_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-256_8xb256_in1k.py b/configs/levit/deploy/levit-256_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..687f83506e30fcf36041729b70b30822b30cae81
--- /dev/null
+++ b/configs/levit/deploy/levit-256_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-256_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-384_8xb256_in1k.py b/configs/levit/deploy/levit-384_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a83d47a54507022389bfb34c50ae466c978586b
--- /dev/null
+++ b/configs/levit/deploy/levit-384_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-384_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/levit-128_8xb256_in1k.py b/configs/levit/levit-128_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdec48e3ffbb317ae464be244bf8e05cf4c41165
--- /dev/null
+++ b/configs/levit/levit-128_8xb256_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(arch='128'), head=dict(in_channels=384))
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-128s_8xb256_in1k.py b/configs/levit/levit-128s_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0564cac7e018ec4e311f5e970e9211260ada402c
--- /dev/null
+++ b/configs/levit/levit-128s_8xb256_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(arch='128s'), head=dict(in_channels=384))
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-192_8xb256_in1k.py b/configs/levit/levit-192_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfbf70e0ad2f0a35e4acca090bd6d2cadd6932f0
--- /dev/null
+++ b/configs/levit/levit-192_8xb256_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(arch='192'), head=dict(in_channels=384))
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-256_8xb256_in1k.py b/configs/levit/levit-256_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e961e776faf923f7acceef8b2578f86e7f630afa
--- /dev/null
+++ b/configs/levit/levit-256_8xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-384_8xb256_in1k.py b/configs/levit/levit-384_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..10ceac45c4cc907165d75c6b1b320c07f9a384e9
--- /dev/null
+++ b/configs/levit/levit-384_8xb256_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(arch='384', drop_path_rate=0.1),
+    head=dict(in_channels=768),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/metafile.yml b/configs/levit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..78b62c5c12dcee63d1790b597f0222d7f8324361
--- /dev/null
+++ b/configs/levit/metafile.yml
@@ -0,0 +1,101 @@
+Collections:
+  - Name: LeViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - LeViT Attention Block
+    Paper:
+      Title: "LeViT: a Vision Transformer in ConvNet\u2019s Clothing for Faster Inference"
+      URL: https://arxiv.org/abs/2104.01136
+    README: configs/levit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/levit.py
+      Version: v1.0.0rc5
+
+Models:
+  - Name: levit-128s_3rdparty_in1k
+    Metadata:
+      FLOPs: 310342496
+      Parameters: 7391290
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.51
+          Top 5 Accuracy: 92.90
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-128s_3rdparty_in1k_20230117-e9fbd209.pth
+    Config: configs/levit/levit-128s_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-128S-96703c44.pth
+      Code: https://github.com/facebookresearch/LeViT
+  - Name: levit-128_3rdparty_in1k
+    Metadata:
+      FLOPs: 413060992
+      Parameters: 8828168
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.58
+          Top 5 Accuracy: 93.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-128_3rdparty_in1k_20230117-3be02a02.pth
+    Config: configs/levit/levit-128_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-128-b88c2750.pth
+      Code: https://github.com/facebookresearch/LeViT
+  - Name: levit-192_3rdparty_in1k
+    Metadata:
+      FLOPs: 667860704
+      Parameters: 10561301
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.86
+          Top 5 Accuracy: 94.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-192_3rdparty_in1k_20230117-8217a0f9.pth
+    Config: configs/levit/levit-192_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-192-92712e41.pth
+      Code: https://github.com/facebookresearch/LeViT
+  - Name: levit-256_3rdparty_in1k
+    Metadata:
+      FLOPs: 1141625216
+      Parameters: 18379852
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.59
+          Top 5 Accuracy: 95.46
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-256_3rdparty_in1k_20230117-5ae2ce7d.pth
+    Config: configs/levit/levit-256_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-256-13b5763e.pth
+      Code: https://github.com/facebookresearch/LeViT
+  - Name: levit-384_3rdparty_in1k
+    Metadata:
+      FLOPs: 2372941568
+      Parameters: 38358300
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.59
+          Top 5 Accuracy: 95.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-384_3rdparty_in1k_20230117-f3539cce.pth
+    Config: configs/levit/levit-384_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-384-9bdaf2e2.pth
+      Code: https://github.com/facebookresearch/LeViT
diff --git a/configs/llava/README.md b/configs/llava/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..581abfe5a66c30ce9ff1062d2fe605e17bb2f501
--- /dev/null
+++ b/configs/llava/README.md
@@ -0,0 +1,51 @@
+# LLaVA
+
+> [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
+
+<div align=center>
+<img src="https://github-production-user-asset-6210df.s3.amazonaws.com/26739999/246466979-c2f41b71-1de3-4da8-b20a-eaebe722c339.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model, inference_model
+
+out = inference_model('llava-7b-v1_caption', 'demo/cat-dog.png', device='cuda')
+print(out)
+# {'pred_caption': 'In the image, there are two cats sitting on a blanket.'}
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                   | Params (M) |               Config               |                                                    Download                                                     |
+| :---------------------- | :--------: | :--------------------------------: | :-------------------------------------------------------------------------------------------------------------: |
+| `llava-7b-v1_caption`   |  7045.82   |  [config](llava-7b-v1_caption.py)  |  [ckpt](https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1_liuhaotian_20231025-c9e119b6.pth)  |
+| `llava-7b-v1.5_caption` |  7062.90   | [config](llava-7b-v1.5_caption.py) | [ckpt](https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth) |
+| `llava-7b-v1.5_vqa`     |  7062.90   |   [config](llava-7b-v1.5_vqa.py)   | [ckpt](https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth) |
+
+## Citation
+
+```bibtex
+@misc{liu2023llava,
+      title={Visual Instruction Tuning},
+      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
+      publisher={arXiv:2304.08485},
+      year={2023},
+}
+```
diff --git a/configs/llava/llava-7b-v1.5_caption.py b/configs/llava/llava-7b-v1.5_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..371c9b5f6174416ade8708b9c74bc7f684f2af8c
--- /dev/null
+++ b/configs/llava/llava-7b-v1.5_caption.py
@@ -0,0 +1,76 @@
+_base_ = '../_base_/default_runtime.py'
+
+meta_prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."  # noqa: E501
+image_size = 336
+prompt_tmpl = f'''{meta_prompt} User: <image>
+Describe the image in detail. ASSISTANT:'''
+
+# model settings
+model = dict(
+    type='Llava',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='liuhaotian/llava-v1.5-7b'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        img_size=image_size,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained='https://download.openmmlab.com/mmclassification/v0/clip/'
+        'vit-large-p14_clip-openai-pre_336px_20231025-fb1315ed.pth',
+    ),
+    mm_hidden_size=1024,
+    use_im_patch=False,
+    use_im_start_end=False,
+    mm_proj_depth=2,
+    lang_encoder=dict(
+        type='AutoModelForCausalLM',
+        name_or_path='huggyllama/llama-7b',
+    ),
+    task='caption',
+    prompt_tmpl=prompt_tmpl,
+    generation_cfg=dict(num_beams=3, max_new_tokens=50, length_penalty=-1.0),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(image_size, image_size),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# schedule settings
+test_cfg = dict()
diff --git a/configs/llava/llava-7b-v1.5_vqa.py b/configs/llava/llava-7b-v1.5_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cb9812cd98b207c96b44da8261f4a11b4f04691
--- /dev/null
+++ b/configs/llava/llava-7b-v1.5_vqa.py
@@ -0,0 +1,76 @@
+_base_ = '../_base_/default_runtime.py'
+
+meta_prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."  # noqa: E501
+image_size = 336
+prompt_tmpl = f'''{meta_prompt} User: <image>
+{{question}} ASSISTANT:'''
+
+# model settings
+model = dict(
+    type='Llava',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='liuhaotian/llava-v1.5-7b'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        img_size=image_size,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained='https://download.openmmlab.com/mmclassification/v0/clip/'
+        'vit-large-p14_clip-openai-pre_336px_20231025-fb1315ed.pth',
+    ),
+    mm_hidden_size=1024,
+    use_im_patch=False,
+    use_im_start_end=False,
+    mm_proj_depth=2,
+    lang_encoder=dict(
+        type='AutoModelForCausalLM',
+        name_or_path='huggyllama/llama-7b',
+    ),
+    task='vqa',
+    prompt_tmpl=prompt_tmpl,
+    generation_cfg=dict(max_new_tokens=100),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(image_size, image_size),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id', 'question']),
+]
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# schedule settings
+test_cfg = dict()
diff --git a/configs/llava/llava-7b-v1_caption.py b/configs/llava/llava-7b-v1_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..92e2d1fb65aab218a2c285c8d97b9f8886681304
--- /dev/null
+++ b/configs/llava/llava-7b-v1_caption.py
@@ -0,0 +1,78 @@
+_base_ = '../_base_/default_runtime.py'
+
+meta_prompt = 'You are LLaVA, a large language and vision assistant trained by UW Madison WAIV Lab.You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.Follow the instructions carefully and explain your answers in detail.'  # noqa: E501
+image_size = 224
+prompt_tmpl = f'''{meta_prompt} User: <im_start><image><im_end>
+Describe the image in detail. ASSISTANT:'''
+
+# model settings
+model = dict(
+    type='Llava',
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='liuhaotian/LLaVA-Lightning-7B-delta-v1-1'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        img_size=image_size,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    mm_hidden_size=1024,
+    use_im_patch=False,
+    use_im_start_end=True,
+    mm_proj_depth=1,
+    lang_encoder=dict(
+        type='AutoModelForCausalLM',
+        name_or_path='huggyllama/llama-7b',
+    ),
+    task='caption',
+    prompt_tmpl=prompt_tmpl,
+    generation_cfg=dict(max_new_tokens=50),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(image_size, image_size),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# schedule settings
+test_cfg = dict()
diff --git a/configs/llava/metafile.yml b/configs/llava/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..406a214c33a5d8a3d1e2b73cfebd51975a27071e
--- /dev/null
+++ b/configs/llava/metafile.yml
@@ -0,0 +1,51 @@
+Collections:
+  - Name: LLaVA
+    Metadata:
+      Architecture:
+        - LLaMA
+        - CLIP
+    Paper:
+      Title: Visual Instruction Tuning
+      URL: https://arxiv.org/abs/2304.08485
+    README: configs/llava/README.md
+
+Models:
+  - Name: llava-7b-v1_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 7045816320
+    In Collection: LLaVA
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: null
+          CIDER: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1_liuhaotian_20231025-c9e119b6.pth
+    Config: configs/llava/llava-7b-v1_caption.py
+  - Name: llava-7b-v1.5_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 7062900736
+    In Collection: LLaVA
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: null
+          CIDER: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth
+    Config: configs/llava/llava-7b-v1.5_caption.py
+  - Name: llava-7b-v1.5_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 7062900736
+    In Collection: LLaVA
+    Results:
+      - Task: Visual Question Answering
+        Dataset: COCO
+        Metrics:
+          BLEU-4: null
+          CIDER: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth
+    Config: configs/llava/llava-7b-v1.5_vqa.py
diff --git a/configs/mae/README.md b/configs/mae/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..69f5f9bf35f9aa4bbe3097c58256496445f864dd
--- /dev/null
+++ b/configs/mae/README.md
@@ -0,0 +1,123 @@
+# MAE
+
+> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper shows that masked autoencoders (MAE) are
+scalable self-supervised learners for computer vision. Our
+MAE approach is simple: we mask random patches of the
+input image and reconstruct the missing pixels. It is based
+on two core designs. First, we develop an asymmetric
+encoder-decoder architecture, with an encoder that operates only on the
+visible subset of patches (without mask tokens), along with a lightweight
+decoder that reconstructs the original image from the latent representation
+and mask tokens. Second, we find that masking a high proportion
+of the input image, e.g., 75%, yields a nontrivial and
+meaningful self-supervisory task. Coupling these two designs enables us to
+train large models efficiently and effectively: we accelerate
+training (by 3× or more) and improve accuracy. Our scalable approach allows
+for learning high-capacity models that generalize well: e.g., a vanilla
+ViT-Huge model achieves the best accuracy (87.8%) among
+methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/150733959-2959852a-c7bd-4d3f-911f-3e8d8839fe67.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mae_vit-base-p16_8xb512-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py None
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                           | Params (M) | Flops (G) |                           Config                           |                                   Download                                   |
+| :---------------------------------------------- | :--------: | :-------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------: |
+| `mae_vit-base-p16_8xb512-amp-coslr-300e_in1k`   |   111.91   |   17.58   |  [config](mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.json) |
+| `mae_vit-base-p16_8xb512-amp-coslr-400e_in1k`   |   111.91   |   17.58   |  [config](mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.json) |
+| `mae_vit-base-p16_8xb512-amp-coslr-800e_in1k`   |   111.91   |   17.58   |  [config](mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.json) |
+| `mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k`  |   111.91   |   17.58   | [config](mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.json) |
+| `mae_vit-large-p16_8xb512-amp-coslr-400e_in1k`  |   329.54   |   61.60   | [config](mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.json) |
+| `mae_vit-large-p16_8xb512-amp-coslr-800e_in1k`  |   329.54   |   61.60   | [config](mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.json) |
+| `mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k` |   329.54   |   61.60   | [config](mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.json) |
+| `mae_vit-huge-p16_8xb512-amp-coslr-1600e_in1k`  |   657.07   |  167.40   | [config](mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k` | [MAE 300-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth) |   86.57    |   17.58   |   83.10   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-400e-pre_8xb128-coslr-100e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth) |   86.57    |   17.58   |   83.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-800e-pre_8xb128-coslr-100e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth) |   86.57    |   17.58   |   83.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-1600e-pre_8xb128-coslr-100e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) |   86.57    |   17.58   |   83.50   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.json) |
+| `vit-base-p16_mae-300e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 300-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth) |   86.57    |   17.58   |   60.80   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth) |   86.57    |   17.58   |   62.50   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth) |   86.57    |   17.58   |   65.10   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) |   86.57    |   17.58   |   67.10   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-400e-pre_8xb128-coslr-50e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth) |   304.32   |   61.60   |   85.20   | [config](benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-800e-pre_8xb128-coslr-50e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth) |   304.32   |   61.60   |   85.40   | [config](benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-1600e-pre_8xb128-coslr-50e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth) |   304.32   |   61.60   |   85.70   | [config](benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth) |   304.33   |   61.60   |   70.70   | [config](benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth) |   304.33   |   61.60   |   73.70   | [config](benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth) |   304.33   |   61.60   |   75.50   | [config](benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-huge-p14_mae-1600e-pre_8xb128-coslr-50e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth) |   632.04   |  167.40   |   86.90   | [config](benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k_20220916-0bfc9bfd.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k_20220916-0bfc9bfd.json) |
+| `vit-huge-p14_mae-1600e-pre_32xb8-coslr-50e_in1k-448px` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth) |   633.03   |  732.13   |   87.30   | [config](benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448_20220916-95b6a0ce.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448_20220916-95b6a0ce.json) |
+
+## Citation
+
+```bibtex
+@article{He2021MaskedAA,
+  title={Masked Autoencoders Are Scalable Vision Learners},
+  author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and
+  Piotr Doll'ar and Ross B. Girshick},
+  journal={arXiv},
+  year={2021}
+}
+```
diff --git a/configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cf9ca1134766cd3b0179b7581511cd94dedbbc2
--- /dev/null
+++ b/configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=2e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py b/configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0545c99d002925886349c7979ab0722fbf8f37a
--- /dev/null
+++ b/configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
@@ -0,0 +1,64 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=10,
+        end=90,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py b/configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..60046b48d49f2bcc74a672c7b615da3062ad829b
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py
@@ -0,0 +1,116 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=448,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=512,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='huge',
+        img_size=448,
+        patch_size=14,
+        drop_path_rate=0.3,  # set to 0.3
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+# learning rate and layer decay rate are set to 0.004 and 0.75 respectively
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.75,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a9ff51890be80c6070058b2dd3e837027864da5
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py
@@ -0,0 +1,115 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='huge',
+        img_size=224,
+        patch_size=14,
+        drop_path_rate=0.3,  # set to 0.3
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+# learning rate and layer decay rate are set to 0.004 and 0.75 respectively
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.75,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-huge-p14_8xb128-ds-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-huge-p14_8xb128-ds-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..813f7c03f300e1579b2ca036995b1a78135f2293
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_8xb128-ds-coslr-50e_in1k.py
@@ -0,0 +1,31 @@
+_base_ = ['./vit-huge-p14_8xb128-coslr-50e_in1k.py']
+
+# optimizer wrapper
+optim_wrapper = dict(type='DeepSpeedOptimWrapper')
+
+# training strategy
+strategy = dict(
+    type='DeepSpeedStrategy',
+    fp16=dict(
+        enabled=True,
+        fp16_master_weights_and_grads=False,
+        loss_scale=0,
+        loss_scale_window=500,
+        hysteresis=2,
+        min_loss_scale=1,
+        initial_scale_power=15,
+    ),
+    inputs_to_half=['inputs'],
+    zero_optimization=dict(
+        stage=1,
+        allgather_partitions=True,
+        reduce_scatter=True,
+        allgather_bucket_size=50000000,
+        reduce_bucket_size=50000000,
+        overlap_comm=True,
+        contiguous_gradients=True,
+        cpu_offload=False,
+    ))
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-huge-p14_8xb128-fsdp-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-huge-p14_8xb128-fsdp-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f8dfb760f3e0282a5efce7bd9322ca381a802c2
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_8xb128-fsdp-coslr-50e_in1k.py
@@ -0,0 +1,13 @@
+_base_ = ['./vit-huge-p14_8xb128-coslr-50e_in1k.py']
+
+strategy = dict(
+    type='FSDPStrategy',
+    model_wrapper=dict(
+        auto_wrap_policy=dict(
+            type='torch.distributed.fsdp.wrap.size_based_auto_wrap_policy',
+            min_num_params=1e7)))
+
+optim_wrapper = dict(type='AmpOptimWrapper')
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae86b40b8a262bc9f33e523afd161fdb014971bd
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
@@ -0,0 +1,115 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.2,  # set to 0.2
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+# learning rate and layer decay rate are set to 0.004 and 0.75 respectively
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.75,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb128-ds-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb128-ds-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9aedb431c5521f725912983444523f25340eac2a
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb128-ds-coslr-50e_in1k.py
@@ -0,0 +1,31 @@
+_base_ = ['./vit-large-p16_8xb128-coslr-50e_in1k.py']
+
+# optimizer wrapper
+optim_wrapper = dict(type='DeepSpeedOptimWrapper')
+
+# training strategy
+strategy = dict(
+    type='DeepSpeedStrategy',
+    fp16=dict(
+        enabled=True,
+        fp16_master_weights_and_grads=False,
+        loss_scale=0,
+        loss_scale_window=500,
+        hysteresis=2,
+        min_loss_scale=1,
+        initial_scale_power=15,
+    ),
+    inputs_to_half=['inputs'],
+    zero_optimization=dict(
+        stage=1,
+        allgather_partitions=True,
+        reduce_scatter=True,
+        allgather_bucket_size=50000000,
+        reduce_bucket_size=50000000,
+        overlap_comm=True,
+        contiguous_gradients=True,
+        cpu_offload=False,
+    ))
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb128-fsdp-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb128-fsdp-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a8a67401eb3bb7204521d6ff97603eebc7e00c9
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb128-fsdp-coslr-50e_in1k.py
@@ -0,0 +1,13 @@
+_base_ = ['./vit-large-p16_8xb128-coslr-50e_in1k.py']
+
+strategy = dict(
+    type='FSDPStrategy',
+    model_wrapper=dict(
+        auto_wrap_policy=dict(
+            type='torch.distributed.fsdp.wrap.size_based_auto_wrap_policy',
+            min_num_params=1e7)))
+
+optim_wrapper = dict(type='AmpOptimWrapper')
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c89518141c148161b2dbf082aa7b0a2eb0843539
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
@@ -0,0 +1,64 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=24,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=1024),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=10,
+        end=90,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..76c0df22b7bc5ac52dd50ebdaf4b141efa20352f
--- /dev/null
+++ b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8107fccb5c5c18df90cda43cccf21cb7b86f5245
--- /dev/null
+++ b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c150e0412b2092ec7a137bd3e488cea00ef2fc7f
--- /dev/null
+++ b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d5e40db5478755f751f4dd1c989d0c5906ca1d7
--- /dev/null
+++ b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEHiViT', arch='large'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c6c47d08fdfa676dd30f628fa06c60595434f85
--- /dev/null
+++ b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEHiViT', arch='large'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ed7d207a135264f9a1c20863fbf80d493f6f678
--- /dev/null
+++ b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEHiViT', arch='large'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbad841818f0a96ab233b96820446c7b0d72de4a
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f11fb2fa98c55034a7fa3397ea337044e43f3358
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8f0398356cc8c1302d9739d73b88bec0bab3b92
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..01e0fb423969642174ac38d19a57e0db5c6cfc61
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.000000001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5eb7a427eb0a7cfcf2da5cbc85aa1ca89d82d152
--- /dev/null
+++ b/configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,66 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='h', patch_size=14),
+    neck=dict(
+        type='MAEPretrainDecoder',
+        embed_dim=1280,
+        patch_size=14,
+        num_patches=256),
+    head=dict(patch_size=14))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..683790c0c9a80c532e0865627f48e313b3fc6595
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='l'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-300e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..539207466d25617946b2dde38612587da2b6f30e
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='l'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f050522a2209fea0feaa2a594e10900fca47f006
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='l'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a4294db3275a405357c08b09c07f5672faa4adc
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='l'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.000000001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/metafile.yml b/configs/mae/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8192672305de26ee20d00e1a59ad3180322491ed
--- /dev/null
+++ b/configs/mae/metafile.yml
@@ -0,0 +1,367 @@
+Collections:
+  - Name: MAE
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Masked Autoencoders Are Scalable Vision Learners
+      URL: https://arxiv.org/abs/2111.06377
+    README: configs/mae/README.md
+
+Models:
+  - Name: mae_vit-base-p16_8xb512-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907840
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth
+    Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_mae-300e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k
+  - Name: mae_vit-base-p16_8xb512-amp-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907840
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth
+    Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py
+    Downstream:
+      - vit-base-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-base-p16_mae-400e-pre_8xb128-coslr-100e_in1k
+  - Name: mae_vit-base-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      Epochs: 800
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907840
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth
+    Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - vit-base-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-base-p16_mae-800e-pre_8xb128-coslr-100e_in1k
+  - Name: mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k
+    Metadata:
+      Epochs: 1600
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907840
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth
+    Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py
+    Downstream:
+      - vit-base-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-base-p16_mae-1600e-pre_8xb128-coslr-100e_in1k
+  - Name: mae_vit-large-p16_8xb512-amp-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 61603111936
+      Parameters: 329541888
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth
+    Config: configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py
+    Downstream:
+      - vit-large-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-large-p16_mae-400e-pre_8xb128-coslr-50e_in1k
+  - Name: mae_vit-large-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      Epochs: 800
+      Batch Size: 4096
+      FLOPs: 61603111936
+      Parameters: 329541888
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth
+    Config: configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - vit-large-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-large-p16_mae-800e-pre_8xb128-coslr-50e_in1k
+  - Name: mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k
+    Metadata:
+      Epochs: 1600
+      Batch Size: 4096
+      FLOPs: 61603111936
+      Parameters: 329541888
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth
+    Config: configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
+    Downstream:
+      - vit-large-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-large-p16_mae-1600e-pre_8xb128-coslr-50e_in1k
+  - Name: mae_vit-huge-p16_8xb512-amp-coslr-1600e_in1k
+    Metadata:
+      Epochs: 1600
+      Batch Size: 4096
+      FLOPs: 167400741120
+      Parameters: 657074508
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth
+    Config: configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py
+    Downstream:
+      - vit-huge-p14_mae-1600e-pre_8xb128-coslr-50e_in1k
+      - vit-huge-p14_mae-1600e-pre_32xb8-coslr-50e_in1k-448px
+  - Name: vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.1
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mae-400e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.3
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mae-800e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.3
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mae-1600e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth
+    Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mae-300e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 60.8
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-base-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 62.5
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-base-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 65.1
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-base-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 67.1
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-large-p16_mae-400e-pre_8xb128-coslr-50e_in1k
+    Metadata:
+      Epochs: 50
+      Batch Size: 1024
+      FLOPs: 61602103296
+      Parameters: 304324584
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.2
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
+  - Name: vit-large-p16_mae-800e-pre_8xb128-coslr-50e_in1k
+    Metadata:
+      Epochs: 50
+      Batch Size: 1024
+      FLOPs: 61602103296
+      Parameters: 304324584
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.4
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
+  - Name: vit-large-p16_mae-1600e-pre_8xb128-coslr-50e_in1k
+    Metadata:
+      Epochs: 50
+      Batch Size: 1024
+      FLOPs: 61602103296
+      Parameters: 304324584
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.7
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
+  - Name: vit-large-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 61603112960
+      Parameters: 304326632
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.7
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-large-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 61603112960
+      Parameters: 304326632
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.7
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-large-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 61603112960
+      Parameters: 304326632
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 75.5
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-huge-p14_mae-1600e-pre_8xb128-coslr-50e_in1k
+    Metadata:
+      Epochs: 50
+      Batch Size: 1024
+      FLOPs: 167399096320
+      Parameters: 632043240
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.9
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k_20220916-0bfc9bfd.pth
+    Config: configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py
+  - Name: vit-huge-p14_mae-1600e-pre_32xb8-coslr-50e_in1k-448px
+    Metadata:
+      Epochs: 50
+      Batch Size: 256
+      FLOPs: 732131983360
+      Parameters: 633026280
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.3
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448_20220916-95b6a0ce.pth
+    Config: configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py
diff --git a/configs/maskfeat/README.md b/configs/maskfeat/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d25b32bb2d45990d91185de0cb34ee7e5dd9ecc5
--- /dev/null
+++ b/configs/maskfeat/README.md
@@ -0,0 +1,85 @@
+# MaskFeat
+
+> [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133v1)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/48178838/190090285-428f07c0-0887-4ce8-b94f-f719cfd25622.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                              | Params (M) | Flops (G) |                            Config                             |                                Download                                |
+| :------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :--------------------------------------------------------------------: |
+| `maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k` |   85.88    |   17.58   | [config](maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k` | [MASKFEAT](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth) |   86.57    |   17.58   |   83.40   | [config](benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.json) |
+
+## Citation
+
+```bibtex
+@InProceedings{wei2022masked,
+    author    = {Wei, Chen and Fan, Haoqi and Xie, Saining and Wu, Chao-Yuan and Yuille, Alan and Feichtenhofer, Christoph},
+    title     = {Masked Feature Prediction for Self-Supervised Visual Pre-Training},
+    booktitle = {CVPR},
+    year      = {2022},
+}
+```
diff --git a/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py b/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a7620b46b4337adbff8aa97834d347c5da09e55
--- /dev/null
+++ b/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs'),
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=256, dataset=dict(pipeline=test_pipeline))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=2e-5, bias=2e-5)
+        ]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=8e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0)
diff --git a/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..465ff5c36465080be4ad50e6b1511b728c3318f1
--- /dev/null
+++ b/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,103 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.5, 1.0),
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=14,
+        num_masking_patches=78,
+        min_num_patches=15,
+    ),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='MaskFeat',
+    backbone=dict(type='MaskFeatViT', arch='b', patch_size=16),
+    neck=dict(
+        type='LinearNeck',
+        in_channels=768,
+        out_channels=108,
+        norm_cfg=None,
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=0.02, bias=0)),
+    head=dict(
+        type='MIMHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    target_generator=dict(
+        type='HOGGenerator', nbins=9, pool=8, gaussian_window=16))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW', lr=2e-4 * 8, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=0.02),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        norm_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            # 'pos_embed': dict(decay_mult=0.),
+            # 'cls_token': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=30,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=270,
+        eta_min=1e-6,
+        by_epoch=True,
+        begin=30,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/maskfeat/metafile.yml b/configs/maskfeat/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1e1e1b4ae263077d2f88bc40aa893a57e3bba14a
--- /dev/null
+++ b/configs/maskfeat/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+  - Name: MaskFeat
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Masked Feature Prediction for Self-Supervised Visual Pre-Training
+      URL: https://arxiv.org/abs/2112.09133v1
+    README: configs/maskfeat/README.md
+
+Models:
+  - Name: maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581972224
+      Parameters: 85882692
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth
+    Config: configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k
+  - Name: vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.4
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth
+    Config: configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
diff --git a/configs/mff/README.md b/configs/mff/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7001c74be203f5997275372e57e9de4952a8f9f3
--- /dev/null
+++ b/configs/mff/README.md
@@ -0,0 +1,60 @@
+# MFF
+
+> [Improving Pixel-based MIM by Reducing Wasted Modeling Capability](https://arxiv.org/abs/2308.00261)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/257412932-5f36b11b-ee64-4ce7-b7d1-a31000302bd8.png" width="80%"/>
+</div>
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py None
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                         | Params (M) | Flops (G) |                          Config                          |                                     Download                                     |
+| :-------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :------------------------------------------------------------------------------: |
+| `mff_vit-base-p16_8xb512-amp-coslr-300e_in1k` |     -      |     -     | [config](mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.json) |
+| `mff_vit-base-p16_8xb512-amp-coslr-800e_in1k` |     -      |     -     | [config](mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k` | [MFF 300-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) |   86.57    |   17.58   |   83.00   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth)  /   [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.json) |
+| `vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k` | [MFF 800-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth) |   86.57    |   17.58   |   83.70   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.pth) / [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.json) |
+| `vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k` | [MFF 300-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) |   304.33   |   61.60   |   64.20   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb2048-linear-coslr-90e_in1k/vit-base-p16_8xb2048-linear-coslr-90e_in1k.json) |
+| `vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MFF 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) |   304.33   |   61.60   |   68.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb2048-linear-coslr-90e/vit-base-p16_8xb2048-linear-coslr-90e_20230802-6b1f7bc8.pth)  / [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb2048-linear-coslr-90e/vit-base-p16_8xb2048-linear-coslr-90e_20230802-6b1f7bc8.json) |
+
+## Citation
+
+```bibtex
+@article{MFF,
+  title={Improving Pixel-based MIM by Reducing Wasted Modeling Capability},
+  author={Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin},
+  journal={arXiv},
+  year={2023}
+}
+```
diff --git a/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cf9ca1134766cd3b0179b7581511cd94dedbbc2
--- /dev/null
+++ b/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=2e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py b/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc5f23077a20dad906fb44cf074322b394ea021d
--- /dev/null
+++ b/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ToPIL', to_rgb=True),
+    dict(type='MAERandomResizedCrop', size=224, interpolation=3),
+    dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+    dict(type='ToNumpy', to_bgr=True),
+    dict(type='PackInputs'),
+]
+
+# dataset settings
+train_dataloader = dict(
+    batch_size=2048, drop_last=True, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=10,
+        end=90,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=1),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mff/metafile.yml b/configs/mff/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f1da4cc4823e7a4b80bb150987ceccd40e91bedd
--- /dev/null
+++ b/configs/mff/metafile.yml
@@ -0,0 +1,103 @@
+Collections:
+  - Name: MFF
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Improving Pixel-based MIM by Reducing Wasted Modeling Capability
+      URL: https://arxiv.org/pdf/2308.00261.pdf
+    README: configs/mff/README.md
+
+Models:
+  - Name: mff_vit-base-p16_8xb512-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581972224
+      Parameters: 85882692
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth
+    Config: configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k
+  - Name: mff_vit-base-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      Epochs: 800
+      Batch Size: 2048
+      FLOPs: 17581972224
+      Parameters: 85882692
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth
+    Config: configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k
+  - Name: vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.0
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth
+    Config: configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MFF
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.pth
+    Config: configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MFF
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 64.2
+    Weights:
+    Config: configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MFF
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 68.3
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth
+    Config: configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
diff --git a/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9fc5219e4d8d7384bfc0e24bc98c67a71964962
--- /dev/null
+++ b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,24 @@
+_base_ = '../mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py'
+
+randomness = dict(seed=2, diff_rank_seed=True)
+
+# dataset config
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ToPIL', to_rgb=True),
+    dict(type='torchvision/Resize', size=224),
+    dict(
+        type='torchvision/RandomCrop',
+        size=224,
+        padding=4,
+        padding_mode='reflect'),
+    dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+    dict(type='ToNumpy', to_bgr=True),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model config
+model = dict(
+    type='MFF', backbone=dict(type='MFFViT', out_indices=[0, 2, 4, 6, 8, 11]))
diff --git a/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8976b22dd94d4d5d0906542c495fc23833d8e02
--- /dev/null
+++ b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,24 @@
+_base_ = '../mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py'
+
+randomness = dict(seed=2, diff_rank_seed=True)
+
+# dataset config
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ToPIL', to_rgb=True),
+    dict(type='torchvision/Resize', size=224),
+    dict(
+        type='torchvision/RandomCrop',
+        size=224,
+        padding=4,
+        padding_mode='reflect'),
+    dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+    dict(type='ToNumpy', to_bgr=True),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model config
+model = dict(
+    type='MFF', backbone=dict(type='MFFViT', out_indices=[0, 2, 4, 6, 8, 11]))
diff --git a/configs/milan/README.md b/configs/milan/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e1fe2289c56d27bd2fb9c6655dce769e92b155c7
--- /dev/null
+++ b/configs/milan/README.md
@@ -0,0 +1,104 @@
+# MILAN
+
+> [MILAN: Masked Image Pretraining on Language Assisted Representation](https://arxiv.org/pdf/2208.06049)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Self-attention based transformer models have been dominating many computer
+vision tasks in the past few years. Their superb model qualities heavily depend
+on the excessively large labeled image datasets. In order to reduce the reliance
+on large labeled datasets, reconstruction based masked autoencoders are gaining
+popularity, which learn high quality transferable representations from unlabeled
+images. For the same purpose, recent weakly supervised image pretraining methods
+explore language supervision from text captions accompanying the images. In this
+work, we propose masked image pretraining on language assisted representation,
+dubbed as MILAN. Instead of predicting raw pixels or low level features, our
+pretraining objective is to reconstruct the image features with substantial semantic
+signals that are obtained using caption supervision. Moreover, to accommodate our
+reconstruction target, we propose a more efficient prompting decoder architecture
+and a semantic aware mask sampling mechanism, which further advance the
+transfer performance of the pretrained model. Experimental results demonstrate
+that MILAN delivers higher accuracy than the previous works. When the masked
+autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input
+resolution of 224×224, MILAN achieves a top-1 accuracy of 85.4% on ViTB/16, surpassing previous state-of-the-arts by 1%. In the downstream semantic
+segmentation task, MILAN achieves 52.7 mIoU using ViT-B/16 backbone on
+ADE20K dataset, outperforming previous masked pretraining results by 4 points.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/205210369-41a65c4c-bcd4-4147-91ea-c6c9061ab455.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_milan-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('milan_vit-base-p16_16xb256-amp-coslr-400e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                            | Params (M) | Flops (G) |                           Config                            |                                  Download                                  |
+| :----------------------------------------------- | :--------: | :-------: | :---------------------------------------------------------: | :------------------------------------------------------------------------: |
+| `milan_vit-base-p16_16xb256-amp-coslr-400e_in1k` |   111.91   |   17.58   | [config](milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_milan-pre_8xb128-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   85.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.json) |
+| `vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   78.90   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.json) |
+
+## Citation
+
+```bibtex
+@article{Hou2022MILANMI,
+  title={MILAN: Masked Image Pretraining on Language Assisted Representation},
+  author={Zejiang Hou and Fei Sun and Yen-Kuang Chen and Yuan Xie and S. Y. Kung},
+  journal={ArXiv},
+  year={2022}
+}
+```
diff --git a/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a3f4983ac19208090ee63e9c9160b945b22ee6
--- /dev/null
+++ b/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py b/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b7333ca475ad1d9607ddda898acb623e1bd7aa4
--- /dev/null
+++ b/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]),
+    data_preprocessor=dict(
+        num_classes=1000,
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        to_rgb=True,
+    ))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=3.2, weight_decay=0.0, momentum=0.9),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/milan/metafile.yml b/configs/milan/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..a790815fa28d063f909dfc1855b2a33f67f59893
--- /dev/null
+++ b/configs/milan/metafile.yml
@@ -0,0 +1,59 @@
+Collections:
+  - Name: MILAN
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 16x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: 'MILAN: Masked Image Pretraining on Language Assisted Representation'
+      URL: https://arxiv.org/pdf/2208.06049
+    README: configs/milan/README.md
+
+Models:
+  - Name: milan_vit-base-p16_16xb256-amp-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907584
+      Training Data: ImageNet-1k
+    In Collection: MILAN
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth
+    Config: configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+    Downstream:
+      - vit-base-p16_milan-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k
+  - Name: vit-base-p16_milan-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MILAN
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.3
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
+    Config: configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MILAN
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.9
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth
+    Config: configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
diff --git a/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py b/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac80ab7b1bff159eed3eacc432a1b7b48e4cb221
--- /dev/null
+++ b/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
@@ -0,0 +1,88 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='MILAN',
+    backbone=dict(
+        type='MILANViT',
+        arch='b',
+        patch_size=16,
+        mask_ratio=0.75,
+        init_cfg=[
+            dict(type='Xavier', distribution='uniform', layer='Linear'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type='MILANPretrainDecoder',
+        init_cfg=[
+            dict(type='Xavier', distribution='uniform', layer='Linear'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    head=dict(
+        type='MIMHead',
+        loss=dict(
+            type='CosineSimilarityLoss', shift_factor=2.0, scale_factor=2.0),
+    ),
+    target_generator=dict(
+        type='CLIPGenerator',
+        tokenizer_path=  # noqa
+        'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+    ),
+    init_cfg=None)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/minigpt4/README.md b/configs/minigpt4/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..23666fc9f951262bf9aee65dda933c0000b891f8
--- /dev/null
+++ b/configs/minigpt4/README.md
@@ -0,0 +1,53 @@
+# MiniGPT4
+
+> [MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models](https://arxiv.org/abs/2304.10592)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model's generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/1d8f328d-6c91-493e-8992-29e84a0fc3c8" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('minigpt-4_vicuna-7b_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'This image shows a small dog and a kitten sitting on a blanket in a field of flowers. The dog is looking up at the kitten with a playful expression on its face. The background is a colorful striped blanket, and there are flowers all around them. The image is well composed with the two animals sitting in the center of the frame, surrounded by the flowers and blanket.'}
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+For Vicuna model, please refer to [MiniGPT-4 page](https://github.com/Vision-CAIR/MiniGPT-4) for preparation guidelines.
+
+### Pretrained models
+
+| Model                           | Params (M) | Flops (G) |                   Config                   |                                                   Download                                                   |
+| :------------------------------ | :--------: | :-------: | :----------------------------------------: | :----------------------------------------------------------------------------------------------------------: |
+| `minigpt-4_baichuan-7b_caption` |  8094.77   |    N/A    | [config](minigpt-4_baichuan-7b_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_baichuan7b_20231011-5dca7ed6.pth) |
+| `minigpt-4_vicuna-7b_caption`\* |  8121.32   |    N/A    |  [config](minigpt-4_vicuna-7b_caption.py)  | [model](https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_vicuna7b_20230615-714b5f52.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Vision-CAIR/MiniGPT-4/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{zhu2023minigpt,
+  title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
+  author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
+  journal={arXiv preprint arXiv:2304.10592},
+  year={2023}
+}
+```
diff --git a/configs/minigpt4/metafile.yml b/configs/minigpt4/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f70cc9ba6045f237414f8dc3ee8572187528a667
--- /dev/null
+++ b/configs/minigpt4/metafile.yml
@@ -0,0 +1,37 @@
+Collections:
+  - Name: MiniGPT4
+    Metadata:
+      Architecture:
+        - Transformer
+        - Gated Cross-Attention Dense
+    Paper:
+      Title: 'MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models'
+      URL: https://arxiv.org/abs/2304.10592
+    README: configs/minigpt4/README.md
+
+Models:
+  - Name: minigpt-4_vicuna-7b_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8121315072
+    In Collection: MiniGPT4
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_vicuna7b_20230615-714b5f52.pth
+    Config: configs/minigpt4/minigpt-4_vicuna-7b_caption.py
+    Converted From:
+      Weights: https://github.com/Vision-CAIR/MiniGPT-4/tree/main
+      Code: https://github.com/Vision-CAIR/MiniGPT-4/tree/main
+  - Name: minigpt-4_baichuan-7b_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8094769024
+    In Collection: MiniGPT4
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_baichuan7b_20231011-5dca7ed6.pth
+    Config: configs/minigpt4/minigpt-4_baichuan-7b_caption.py
diff --git a/configs/minigpt4/minigpt-4_baichuan-7b_caption.py b/configs/minigpt4/minigpt-4_baichuan-7b_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e610a099c8dfcea86dff87c69487f6879926f21
--- /dev/null
+++ b/configs/minigpt4/minigpt-4_baichuan-7b_caption.py
@@ -0,0 +1,190 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='CleanCaption',
+        keys='chat_content',
+        remove_chars='',
+        lowercase=False),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['chat_content', 'lang'],
+        meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=4,
+    dataset=dict(
+        type='MiniGPT4Dataset',
+        data_root='YOUR_DATA_DIRECTORY',
+        ann_file='YOUR_DATA_FILE',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    drop_last=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+test_dataloader = dict(
+    batch_size=1,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline))
+
+# model settings
+model = dict(
+    type='MiniGPT4',
+    vision_encoder=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        frozen_stages=39,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw',
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth'  # noqa
+    ),
+    q_former_model=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32,
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_qformer_20230615-1dfa889c.pth'  # noqa
+    ),
+    lang_encoder=dict(
+        type='AutoModelForCausalLM',
+        name_or_path='baichuan-inc/baichuan-7B',
+        trust_remote_code=True),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='baichuan-inc/baichuan-7B',
+        trust_remote_code=True),
+    task='caption',
+    prompt_template=dict([('en', '###Ask: {} ###Answer: '),
+                          ('zh', '###问：{} ###答：')]),
+    raw_prompts=dict([
+        ('en', [('<Img><ImageHere></Img> '
+                 'Describe this image in detail.'),
+                ('<Img><ImageHere></Img> '
+                 'Take a look at this image and describe what you notice.'),
+                ('<Img><ImageHere></Img> '
+                 'Please provide a detailed description of the picture.'),
+                ('<Img><ImageHere></Img> '
+                 'Could you describe the contents of this image for me?')]),
+        ('zh', [('<Img><ImageHere></Img> '
+                 '详细描述这张图片。'), ('<Img><ImageHere></Img> '
+                                '浏览这张图片并描述你注意到什么。'),
+                ('<Img><ImageHere></Img> '
+                 '请对这张图片进行详细的描述。'),
+                ('<Img><ImageHere></Img> '
+                 '你能为我描述这张图片的内容吗？')])
+    ]),
+    max_txt_len=160,
+    end_sym='###')
+
+strategy = dict(
+    type='DeepSpeedStrategy',
+    fp16=dict(
+        enabled=True,
+        auto_cast=False,
+        fp16_master_weights_and_grads=False,
+        loss_scale=0,
+        loss_scale_window=1000,
+        hysteresis=1,
+        min_loss_scale=1,
+        initial_scale_power=16,
+    ),
+    inputs_to_half=[0],
+    zero_optimization=dict(
+        stage=2,
+        allgather_partitions=True,
+        allgather_bucket_size=2e8,
+        reduce_scatter=True,
+        reduce_bucket_size='auto',
+        overlap_comm=True,
+        contiguous_gradients=True,
+    ),
+)
+
+# schedule settings
+optim_wrapper = dict(
+    type='DeepSpeedOptimWrapper',
+    optimizer=dict(type='AdamW', lr=1e-3, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-3 / 500,
+        by_epoch=False,
+        begin=0,
+        end=500,
+    ),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=2e-4,
+        by_epoch=False,
+        begin=500,
+    ),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=6)
+test_cfg = dict()
+
+runner_type = 'FlexibleRunner'
+
+default_hooks = dict(
+    checkpoint=dict(
+        type='CheckpointHook',
+        interval=1,
+        by_epoch=True,
+        save_last=True,
+        max_keep_ckpts=1,
+    ))
diff --git a/configs/minigpt4/minigpt-4_vicuna-7b_caption.py b/configs/minigpt4/minigpt-4_vicuna-7b_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..f468e2d8fac7ce46871801c9cc490acb97db683d
--- /dev/null
+++ b/configs/minigpt4/minigpt-4_vicuna-7b_caption.py
@@ -0,0 +1,94 @@
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(batch_size=1, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='MiniGPT4',
+    vision_encoder=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        frozen_stages=39,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw',
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth'  # noqa
+    ),
+    q_former_model=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32,
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_qformer_20230615-1dfa889c.pth'  # noqa
+    ),
+    lang_encoder=dict(
+        type='AutoModelForCausalLM', name_or_path='YOUR_PATH_TO_VICUNA'),
+    tokenizer=dict(type='LlamaTokenizer', name_or_path='YOUR_PATH_TO_VICUNA'),
+    task='caption',
+    prompt_template=dict([('en', '###Ask: {} ###Answer: '),
+                          ('zh', '###问：{} ###答：')]),
+    raw_prompts=dict([
+        ('en', [('<Img><ImageHere></Img> '
+                 'Describe this image in detail.'),
+                ('<Img><ImageHere></Img> '
+                 'Take a look at this image and describe what you notice.'),
+                ('<Img><ImageHere></Img> '
+                 'Please provide a detailed description of the picture.'),
+                ('<Img><ImageHere></Img> '
+                 'Could you describe the contents of this image for me?')]),
+        ('zh', [('<Img><ImageHere></Img> '
+                 '详细描述这张图片。'), ('<Img><ImageHere></Img> '
+                                '浏览这张图片并描述你注意到什么。'),
+                ('<Img><ImageHere></Img> '
+                 '请对这张图片进行详细的描述。'),
+                ('<Img><ImageHere></Img> '
+                 '你能为我描述这张图片的内容吗？')])
+    ]),
+    max_txt_len=160,
+    end_sym='###')
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=5)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/mixmim/README.md b/configs/mixmim/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e07f5011b32463a7be65d2cbe285148e88a6b3fc
--- /dev/null
+++ b/configs/mixmim/README.md
@@ -0,0 +1,102 @@
+# MixMIM
+
+> [MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning](https://arxiv.org/abs/2205.13137)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this study, we propose Mixed and Masked Image Modeling (MixMIM), a
+simple but efficient MIM method that is applicable to various hierarchical Vision
+Transformers. Existing MIM methods replace a random subset of input tokens with
+a special [MASK] symbol and aim at reconstructing original image tokens from
+the corrupted image. However, we find that using the [MASK] symbol greatly
+slows down the training and causes training-finetuning inconsistency, due to the
+large masking ratio (e.g., 40% in BEiT). In contrast, we replace the masked tokens
+of one image with visible tokens of another image, i.e., creating a mixed image.
+We then conduct dual reconstruction to reconstruct the original two images from
+the mixed input, which significantly improves efficiency. While MixMIM can
+be applied to various architectures, this paper explores a simpler but stronger
+hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical
+results demonstrate that MixMIM can learn high-quality visual representations
+efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1
+accuracy on ImageNet-1K by pretraining for 600 epochs, setting a new record for
+neural networks with comparable model sizes (e.g., ViT-B) among MIM methods.
+Besides, its transferring performances on the other 6 datasets show MixMIM has
+better FLOPs / performance tradeoff than previous MIM methods
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/56866854/202853730-d26fb3d7-e5e8-487a-aad5-e3d4600cef87.png"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mixmim_mixmim-base_16xb128-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                        | Params (M) | Flops (G) |                         Config                          |                                      Download                                      |
+| :------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------: | :--------------------------------------------------------------------------------: |
+| `mixmim_mixmim-base_16xb128-coslr-300e_in1k` |   114.67   |   16.35   | [config](mixmim_mixmim-base_16xb128-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k` | [MIXMIM](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth) |   88.34    |   16.35   |   84.63   | [config](benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.json) |
+
+## Citation
+
+```bibtex
+@article{MixMIM2022,
+  author  = {Jihao Liu, Xin Huang, Yu Liu, Hongsheng Li},
+  journal = {arXiv:2205.13137},
+  title   = {MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning},
+  year    = {2022},
+}
+```
diff --git a/configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py b/configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c48ee3b8b64e96490e4e9ceaaab5b2b5b1f3f3cc
--- /dev/null
+++ b/configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,133 @@
+_base_ = [
+    '../../_base_/models/mixmim/mixmim_base.py',
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+
+data_preprocessor = dict(
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=16,
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=8,
+    pin_memory=True,
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_dataloader = val_dataloader
+
+model = dict(
+    backbone=dict(
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * (8 * 128 / 256),
+        betas=(0.9, 0.999),
+        weight_decay=0.05),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.7,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),  # do not decay on ln and bias
+            '.bias': dict(decay_mult=0.0)
+        }))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        eta_min=1e-6,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=1))
diff --git a/configs/mixmim/benchmarks/mixmim-base_8xb64_in1k.py b/configs/mixmim/benchmarks/mixmim-base_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..86ada85f4ef1e7934e44b4f044ff9d9adf88f782
--- /dev/null
+++ b/configs/mixmim/benchmarks/mixmim-base_8xb64_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../../_base_/models/mixmim/mixmim_base.py',
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs256.py',
+    '../../_base_/default_runtime.py'
+]
diff --git a/configs/mixmim/metafile.yml b/configs/mixmim/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5bf87bda937f5091629c89143fd997cad0deb132
--- /dev/null
+++ b/configs/mixmim/metafile.yml
@@ -0,0 +1,51 @@
+Collections:
+  - Name: MixMIM
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation
+        Learning'
+      URL: https://arxiv.org/abs/2205.13137
+    README: configs/mixmim/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/mixmim.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: mixmim_mixmim-base_16xb128-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 16351906816
+      Parameters: 114665784
+      Training Data: ImageNet-1k
+    In Collection: MixMIM
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth
+    Config: configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
+    Downstream:
+      - mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k
+  - Name: mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 16351906816
+      Parameters: 88344352
+      Training Data: ImageNet-1k
+    In Collection: MixMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.63
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth
+    Config: configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py
diff --git a/configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py b/configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..29b94eaea311767a7fe91c47753680e5af6d0181
--- /dev/null
+++ b/configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
@@ -0,0 +1,98 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.0),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='MixMIM',
+    backbone=dict(
+        type='MixMIMPretrainTransformer',
+        arch='B',
+        drop_rate=0.0,
+        drop_path_rate=0.0,  # drop_path_rate=0.0 during pretraining
+        mask_ratio=0.5),
+    neck=dict(
+        type='MixMIMPretrainDecoder',
+        num_patches=49,
+        encoder_stride=32,
+        embed_dim=1024,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16),
+    head=dict(
+        type='MixMIMPretrainHead',
+        norm_pix=True,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * (2048 / 256),
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(custom_keys={
+        'ln': dict(decay_mult=0.0),
+        'bias': dict(decay_mult=0.0)
+    }))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=1))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mlp_mixer/README.md b/configs/mlp_mixer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f0bb4ce0984627f9dafe2f86910348cc20a8a0a7
--- /dev/null
+++ b/configs/mlp_mixer/README.md
@@ -0,0 +1,78 @@
+# MLP-Mixer
+
+> [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/143178327-7118b48a-5f5f-4844-a614-a571917384ca.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mlp-mixer-base-p16_3rdparty_64xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mlp-mixer-base-p16_3rdparty_64xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                        |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                    Config                    |                            Download                             |
+| :------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------: | :-------------------------------------------------------------: |
+| `mlp-mixer-base-p16_3rdparty_64xb64_in1k`\*  | From scratch |   59.88    |   12.61   |   76.68   |   92.25   | [config](mlp-mixer-base-p16_64xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth) |
+| `mlp-mixer-large-p16_3rdparty_64xb64_in1k`\* | From scratch |   208.20   |   44.57   |   72.34   |   88.02   | [config](mlp-mixer-large-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-large-p16_3rdparty_64xb64_in1k_20211124-5a2519d2.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{tolstikhin2021mlpmixer,
+      title={MLP-Mixer: An all-MLP Architecture for Vision},
+      author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
+      year={2021},
+      eprint={2105.01601},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/mlp_mixer/metafile.yml b/configs/mlp_mixer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8b632db100373b10ad7653ed9e0302fa37013ee4
--- /dev/null
+++ b/configs/mlp_mixer/metafile.yml
@@ -0,0 +1,50 @@
+Collections:
+  - Name: MLP-Mixer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - MLP
+        - Layer Normalization
+        - Dropout
+    Paper:
+      URL: https://arxiv.org/abs/2105.01601
+      Title: "MLP-Mixer: An all-MLP Architecture for Vision"
+    README: configs/mlp_mixer/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.18.0/mmcls/models/backbones/mlp_mixer.py
+      Version: v0.18.0
+
+Models:
+  - Name: mlp-mixer-base-p16_3rdparty_64xb64_in1k
+    In Collection: MLP-Mixer
+    Config: configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py
+    Metadata:
+      FLOPs: 12610000000  # 12.61 G
+      Parameters: 59880000  # 59.88 M
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.68
+          Top 5 Accuracy: 92.25
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_mixer_b16_224-76587d61.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py#L70
+
+  - Name: mlp-mixer-large-p16_3rdparty_64xb64_in1k
+    In Collection: MLP-Mixer
+    Config: configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py
+    Metadata:
+      FLOPs: 44570000000  # 44.57 G
+      Parameters: 208200000  # 208.2 M
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.34
+          Top 5 Accuracy: 88.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-large-p16_3rdparty_64xb64_in1k_20211124-5a2519d2.pth
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_mixer_b16_224_in21k-617b3de2.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py#L73
diff --git a/configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py b/configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbf4268d3c6121be57d48e8577f3edebde05114b
--- /dev/null
+++ b/configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/mlp_mixer_base_patch16.py',
+    '../_base_/datasets/imagenet_bs64_mixer_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py',
+]
+
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py b/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fbe9c5c9ebc70ee1b718e904af1bc49fb6d3c78
--- /dev/null
+++ b/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/mlp_mixer_large_patch16.py',
+    '../_base_/datasets/imagenet_bs64_mixer_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py',
+]
+
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/mobilenet_v2/README.md b/configs/mobilenet_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..74548e19698ead42fd7cfb86f8a7c04fbee7f022
--- /dev/null
+++ b/configs/mobilenet_v2/README.md
@@ -0,0 +1,97 @@
+# MobileNet V2
+
+> [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**MobileNet V2** is initially described in [the paper](https://arxiv.org/pdf/1801.04381.pdf), which improves the state of the art performance of mobile models on multiple tasks. MobileNetV2 is an improvement on V1. Its new ideas include Linear Bottleneck and Inverted Residuals, and is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. The author of MobileNet V2 measure its performance on Imagenet classification, COCO object detection, and VOC image segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142563365-7a9ea577-8f79-4c21-a750-ebcaad9bcc2f.png" width="60%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.
+
+The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilenet-v2_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilenet-v2_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                          Download                                          |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `mobilenet-v2_8xb32_in1k` | From scratch |    3.50    |   0.32    |   71.86   |   90.42   | [config](mobilenet-v2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.json) |
+
+## Citation
+
+```bibtex
+@INPROCEEDINGS{8578572,
+  author={M. {Sandler} and A. {Howard} and M. {Zhu} and A. {Zhmoginov} and L. {Chen}},
+  booktitle={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  title={MobileNetV2: Inverted Residuals and Linear Bottlenecks},
+  year={2018},
+  volume={},
+  number={},
+  pages={4510-4520},
+  doi={10.1109/CVPR.2018.00474}}
+}
+```
diff --git a/configs/mobilenet_v2/metafile.yml b/configs/mobilenet_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..aaa490ae485e87c3965f946f3fe25aa52919830b
--- /dev/null
+++ b/configs/mobilenet_v2/metafile.yml
@@ -0,0 +1,34 @@
+Collections:
+  - Name: MobileNet V2
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 300
+      Batch Size: 256
+      Architecture:
+        - MobileNet V2
+    Paper:
+      URL: https://arxiv.org/abs/1801.04381
+      Title: "MobileNetV2: Inverted Residuals and Linear Bottlenecks"
+    README: configs/mobilenet_v2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/mobilenet_v2.py#L101
+      Version: v0.15.0
+
+Models:
+  - Name: mobilenet-v2_8xb32_in1k
+    Metadata:
+      FLOPs: 319000000
+      Parameters: 3500000
+    In Collection: MobileNet V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.86
+          Top 5 Accuracy: 90.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth
+    Config: configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
diff --git a/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py b/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..afd2d9795af601010833ba239465c3e2c5abdf20
--- /dev/null
+++ b/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/mobilenet_v2_1x.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_epochstep.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/mobilenet_v3/README.md b/configs/mobilenet_v3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..833de5b25aae9a8af43f5e086e6e2fd212669d03
--- /dev/null
+++ b/configs/mobilenet_v3/README.md
@@ -0,0 +1,99 @@
+# MobileNet V3
+
+> [Searching for MobileNetV3](https://arxiv.org/abs/1905.02244)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**MobileNet V3** is initially described in [the paper](https://arxiv.org/pdf/1905.02244.pdf). MobileNetV3 parameters are obtained by NAS (network architecture search) search, and some practical results of V1 and V2 are inherited, and the attention mechanism of SE channel is attracted, which can be considered as a masterpiece. The author create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. The author of MobileNet V3 measure its performance on Imagenet classification, COCO object detection, and Cityscapes segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142563801-ef4feacc-ecd7-4d14-a411-8c9d63571749.png" width="60%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 15% compared to MobileNetV2. MobileNetV3-Small is 4.6% more accurate while reducing latency by 5% compared to MobileNetV2. MobileNetV3-Large detection is 25% faster at roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilenet-v3-small-050_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilenet-v3-small-050_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                    |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                      |                             Download                             |
+| :--------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------------: | :--------------------------------------------------------------: |
+| `mobilenet-v3-small-050_3rdparty_in1k`\* | From scratch |    1.59    |   0.02    |   57.91   |   80.19   | [config](mobilenet-v3-small-050_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth) |
+| `mobilenet-v3-small-075_3rdparty_in1k`\* | From scratch |    2.04    |   0.04    |   65.23   |   85.44   | [config](mobilenet-v3-small-075_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-075_3rdparty_in1k_20221114-2011fa76.pth) |
+| `mobilenet-v3-small_8xb128_in1k`         | From scratch |    2.54    |   0.06    |   66.68   |   86.74   |   [config](mobilenet-v3-small_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.json) |
+| `mobilenet-v3-small_3rdparty_in1k`\*     | From scratch |    2.54    |   0.06    |   67.66   |   87.41   |   [config](mobilenet-v3-small_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0.pth) |
+| `mobilenet-v3-large_8xb128_in1k`         | From scratch |    5.48    |   0.23    |   73.49   |   91.31   |   [config](mobilenet-v3-large_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.json) |
+| `mobilenet-v3-large_3rdparty_in1k`\*     | From scratch |    5.48    |   0.23    |   74.04   |   91.34   |   [config](mobilenet-v3-large_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{Howard_2019_ICCV,
+    author = {Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig},
+    title = {Searching for MobileNetV3},
+    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+    month = {October},
+    year = {2019}
+}
+```
diff --git a/configs/mobilenet_v3/metafile.yml b/configs/mobilenet_v3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..53f1653682fa2af2155b786ee5a8f0be9c98698e
--- /dev/null
+++ b/configs/mobilenet_v3/metafile.yml
@@ -0,0 +1,111 @@
+Collections:
+  - Name: MobileNet V3
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - RMSprop with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 600
+      Batch Size: 1024
+      Architecture:
+        - MobileNet V3
+    Paper:
+      URL: https://arxiv.org/abs/1905.02244
+      Title: Searching for MobileNetV3
+    README: configs/mobilenet_v3/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/mobilenet_v3.py
+      Version: v0.15.0
+
+Models:
+  - Name: mobilenet-v3-small-050_3rdparty_in1k
+    Metadata:
+      FLOPs: 24895000
+      Parameters: 1590000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 57.91
+          Top 5 Accuracy: 80.19
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/mobilenetv3_small_050_lambc-4b7bbe87.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilenetv3.py
+  - Name: mobilenet-v3-small-075_3rdparty_in1k
+    Metadata:
+      FLOPs: 44791000
+      Parameters: 2040000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 65.23
+          Top 5 Accuracy: 85.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-075_3rdparty_in1k_20221114-2011fa76.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/mobilenetv3_small_075_lambc-384766db.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilenetv3.py
+  - Name: mobilenet-v3-small_8xb128_in1k
+    Metadata:
+      FLOPs: 60000000
+      Parameters: 2540000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 66.68
+          Top 5 Accuracy: 86.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
+  - Name: mobilenet-v3-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 60000000
+      Parameters: 2540000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 67.66
+          Top 5 Accuracy: 87.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/mobilenet_v3_small-047dcff4.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py
+  - Name: mobilenet-v3-large_8xb128_in1k
+    Metadata:
+      FLOPs: 230000000
+      Parameters: 5480000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.49
+          Top 5 Accuracy: 91.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
+  - Name: mobilenet-v3-large_3rdparty_in1k
+    Metadata:
+      FLOPs: 230000000
+      Parameters: 5480000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.04
+          Top 5 Accuracy: 91.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/mobilenet_v3_large-8738ca79.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py
diff --git a/configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5c05baf39f1cffdb9610d41b1603119a2edc727
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
@@ -0,0 +1,28 @@
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py',
+    '../_base_/datasets/imagenet_bs128_mbv3.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='RMSprop',
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc145625ca22f44ff48a6f4684589ab6833313e3
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py',
+    '../_base_/datasets/imagenet_bs128_mbv3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(norm_cfg=dict(type='BN', eps=1e-5, momentum=0.1)))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(pad_val=[round(x) for x in [103.53, 116.28, 123.675]])),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.2,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='RMSprop',
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..464b7cbd60e8b741f9765df091bfdadbfe1712a3
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py
@@ -0,0 +1,68 @@
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py',
+    '../_base_/datasets/imagenet_bs128_mbv3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(norm_cfg=dict(type='BN', eps=1e-5, momentum=0.1)))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(pad_val=[round(x) for x in [103.53, 116.28, 123.675]])),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.2,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='RMSprop',
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..06b0a328106611ced7ede94c0439f3e39d12f306
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
@@ -0,0 +1,28 @@
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py',
+    '../_base_/datasets/imagenet_bs128_mbv3.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='RMSprop',
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small_8xb16_cifar10.py b/configs/mobilenet_v3/mobilenet-v3-small_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cfaa2f629523ad66966d3e70c9ca084644e1f8d
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small_8xb16_cifar10.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py',
+    '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='MultiStepLR',
+    by_epoch=True,
+    milestones=[120, 170],
+    gamma=0.1,
+)
+
+train_cfg = dict(by_epoch=True, max_epochs=200)
diff --git a/configs/mobileone/README.md b/configs/mobileone/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e753aff9089fe30700f6db4313fd337f73f7d47d
--- /dev/null
+++ b/configs/mobileone/README.md
@@ -0,0 +1,98 @@
+# MobileOne
+
+> [An Improved One millisecond Mobile Backbone](https://arxiv.org/abs/2206.04040)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+Mobileone is proposed by apple and based on reparameterization. On the apple chips, the accuracy of the model is close to 0.76 on the ImageNet dataset when the latency is less than 1ms. Its main improvements based on [RepVGG](../repvgg) are fllowing:
+
+- Reparameterization using Depthwise convolution and Pointwise convolution instead of normal convolution.
+- Removal of the residual structure which is not friendly to access memory.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/183552452-74657532-f461-48f7-9aa7-c23f006cdb07.png" width="40%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device. Therefore, we perform extensive analysis of different metrics by deploying several mobile-friendly networks on a mobile device. We identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks. To this end, we design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile. Our best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. Our model obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, we show that our model generalizes to multiple tasks - image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobileone-s0_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobileone-s0_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobileone/mobileone-s0_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobileone/mobileone-s0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                          Download                                          |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `mobileone-s0_8xb32_in1k` | From scratch |    2.08    |   0.27    |   71.34   |   89.87   | [config](mobileone-s0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.json) |
+| `mobileone-s1_8xb32_in1k` | From scratch |    4.76    |   0.82    |   75.72   |   92.54   | [config](mobileone-s1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.json) |
+| `mobileone-s2_8xb32_in1k` | From scratch |    7.81    |   1.30    |   77.37   |   93.34   | [config](mobileone-s2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.json) |
+| `mobileone-s3_8xb32_in1k` | From scratch |   10.08    |   1.89    |   78.06   |   93.83   | [config](mobileone-s3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.json) |
+| `mobileone-s4_8xb32_in1k` | From scratch |   14.84    |   2.98    |   79.69   |   94.46   | [config](mobileone-s4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.json) |
+
+## Citation
+
+```bibtex
+@article{mobileone2022,
+  title={An Improved One millisecond Mobile Backbone},
+  author={Vasu, Pavan Kumar Anasosalu and Gabriel, James and Zhu, Jeff and Tuzel, Oncel and Ranjan, Anurag},
+  journal={arXiv preprint arXiv:2206.04040},
+  year={2022}
+}
+```
diff --git a/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..145f3f4ec90f643a056177a7d7c0b8fc370539cc
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s0_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8602c31ce6c7c3115e3f45313b687816f0854ddb
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s1_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..97aaddd0740b0a005ecab5b08d3459b0da6c474c
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s2_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d335a7ba9300f8d6d35a288dab02baf0adabdb2
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s3_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b82f5a9ac7ecd6c5fc84369083c66d6dae0afd51
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s4_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/metafile.yml b/configs/mobileone/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..70370da0e8d56baf8001ddaff1f78110462ad86a
--- /dev/null
+++ b/configs/mobileone/metafile.yml
@@ -0,0 +1,83 @@
+Collections:
+  - Name: MobileOne
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - re-parameterization Convolution
+        - VGG-style Neural Network
+        - Depthwise Convolution
+        - Pointwise Convolution
+    Paper:
+      URL: https://arxiv.org/abs/2206.04040
+      Title: 'An Improved One millisecond Mobile Backbone'
+    README: configs/mobileone/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/mobileone/metafile.yml
+      Version: v1.0.0rc1
+
+Models:
+  - Name: mobileone-s0_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s0_8xb32_in1k.py
+    Metadata:
+      FLOPs: 274136576  # 0.27G
+      Parameters: 2078504  # 2.08M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 71.34
+        Top 5 Accuracy: 89.87
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth
+  - Name: mobileone-s1_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s1_8xb32_in1k.py
+    Metadata:
+      FLOPs: 823839744    # 8.6G
+      Parameters: 4764840  # 4.82M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 75.72
+        Top 5 Accuracy: 92.54
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.pth
+  - Name: mobileone-s2_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s2_8xb32_in1k.py
+    Metadata:
+      FLOPs: 1296478848
+      Parameters: 7808168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 77.37
+        Top 5 Accuracy: 93.34
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.pth
+  - Name: mobileone-s3_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s3_8xb32_in1k.py
+    Metadata:
+      FLOPs: 1893842944
+      Parameters: 10078312
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.06
+        Top 5 Accuracy: 93.83
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.pth
+  - Name: mobileone-s4_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s4_8xb32_in1k.py
+    Metadata:
+      FLOPs: 2979222528
+      Parameters: 14838352
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.69
+        Top 5 Accuracy: 94.46
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.pth
diff --git a/configs/mobileone/mobileone-s0_8xb32_in1k.py b/configs/mobileone/mobileone-s0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..be56b86c3ce4afc3cc61995efa60830be98050e0
--- /dev/null
+++ b/configs/mobileone/mobileone-s0_8xb32_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s0.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s1_8xb32_in1k.py b/configs/mobileone/mobileone-s1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0bc3fb08922e0c87ad681e79c378d2b5404b696f
--- /dev/null
+++ b/configs/mobileone/mobileone-s1_8xb32_in1k.py
@@ -0,0 +1,60 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s1.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+import copy  # noqa: E402
+
+# modify start epoch's RandomResizedCrop.scale to 160
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.1
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+# modify 37 epoch's RandomResizedCrop.scale to 192
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_1e[3]['magnitude_level'] *= 0.2
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s2_8xb32_in1k.py b/configs/mobileone/mobileone-s2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7d4aae074952538d5d037b33438172f4c283613
--- /dev/null
+++ b/configs/mobileone/mobileone-s2_8xb32_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s2.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+import copy  # noqa: E402
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+import copy  # noqa: E402
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s3_8xb32_in1k.py b/configs/mobileone/mobileone-s3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2be0dc7e814c4e5a28369ae8888221f3e26ec657
--- /dev/null
+++ b/configs/mobileone/mobileone-s3_8xb32_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s3.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+import copy  # noqa: E402
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+import copy  # noqa: E402
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s4_8xb32_in1k.py b/configs/mobileone/mobileone-s4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..49356f05f9574f90192dc32d5b14c3b74a5cd459
--- /dev/null
+++ b/configs/mobileone/mobileone-s4_8xb32_in1k.py
@@ -0,0 +1,63 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s4.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+import copy  # noqa: E402
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobilevit/README.md b/configs/mobilevit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fa0960d123aed6eae6fee1155fd99d0955355280
--- /dev/null
+++ b/configs/mobilevit/README.md
@@ -0,0 +1,96 @@
+# MobileViT
+
+> [MobileViT Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**MobileViT** aims at introducing a light-weight network, which takes the advantages of both ViTs and CNNs, uses the `InvertedResidual` blocks in [MobileNetV2](../mobilenet_v2/README.md) and `MobileViTBlock` which refers to [ViT](../vision_transformer/README.md) transformer blocks to build a standard 5-stage model structure.
+
+The MobileViTBlock reckons transformers as convolutions to perform a global representation, meanwhile conbined with original convolution layers for local representation to build a block with global receptive field. This is different from ViT, which adds an extra class token and position embeddings for learning relative relationship. Without any position embeddings, MobileViT can benfit from multi-scale inputs during training.
+
+Also, this paper puts forward a strategy for multi-scale training to dynamically adjust batch size based on the image size to both improve training efficiency and final performance.
+
+It is also proven effective in downstream tasks such as object detection and segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/193229983-822bf025-89a6-4d95-b6be-76b7f1a62f2c.png" width="70%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+
+Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilevit-small_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilevit-small_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mobilevit/mobilevit-small_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                               |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                   |                                  Download                                  |
+| :---------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------: |
+| `mobilevit-small_3rdparty_in1k`\*   | From scratch |    5.58    |   2.03    |   78.25   |   94.09   |  [config](mobilevit-small_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth) |
+| `mobilevit-xsmall_3rdparty_in1k`\*  | From scratch |    2.32    |   1.05    |   74.75   |   92.32   | [config](mobilevit-xsmall_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xsmall_3rdparty_in1k_20221018-be39a6e7.pth) |
+| `mobilevit-xxsmall_3rdparty_in1k`\* | From scratch |    1.27    |   0.42    |   69.02   |   88.91   | [config](mobilevit-xxsmall_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xxsmall_3rdparty_in1k_20221018-77835605.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/apple/ml-cvnets). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{mehta2021mobilevit,
+  title={MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
+  author={Mehta, Sachin and Rastegari, Mohammad},
+  journal={arXiv preprint arXiv:2110.02178},
+  year={2021}
+}
+```
diff --git a/configs/mobilevit/metafile.yml b/configs/mobilevit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..15fd84ad54cacf0c7c0337b5139ba891d14c22f5
--- /dev/null
+++ b/configs/mobilevit/metafile.yml
@@ -0,0 +1,60 @@
+Collections:
+  - Name: MobileViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - MobileViT Block
+    Paper:
+      URL: https://arxiv.org/abs/2110.02178
+      Title: MobileViT Light-weight, General-purpose, and Mobile-friendly Vision Transformer
+    README: configs/mobilevit/README.md
+
+Models:
+  - Name: mobilevit-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 2030000000
+      Parameters: 5580000
+    In Collection: MobileViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.25
+          Top 5 Accuracy: 94.09
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth
+    Config: configs/mobilevit/mobilevit-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_s.pt
+      Code: https://github.com/apple/ml-cvnets
+  - Name: mobilevit-xsmall_3rdparty_in1k
+    Metadata:
+      FLOPs: 1050000000
+      Parameters: 2320000
+    In Collection: MobileViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.75
+          Top 5 Accuracy: 92.32
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xsmall_3rdparty_in1k_20221018-be39a6e7.pth
+    Config: configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
+    Converted From:
+      Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_xs.pt
+      Code: https://github.com/apple/ml-cvnets
+  - Name: mobilevit-xxsmall_3rdparty_in1k
+    Metadata:
+      FLOPs: 420000000
+      Parameters: 1270000
+    In Collection: MobileViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.02
+          Top 5 Accuracy: 88.91
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xxsmall_3rdparty_in1k_20221018-77835605.pth
+    Config: configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
+    Converted From:
+      Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_xxs.pt
+      Code: https://github.com/apple/ml-cvnets
diff --git a/configs/mobilevit/mobilevit-small_8xb128_in1k.py b/configs/mobilevit/mobilevit-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..596939631c0520e67d480a37669704556719f2dc
--- /dev/null
+++ b/configs/mobilevit/mobilevit-small_8xb128_in1k.py
@@ -0,0 +1,30 @@
+_base_ = [
+    '../_base_/models/mobilevit/mobilevit_s.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255],
+    # use bgr directly
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=288, edge='short'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+    batch_size=128,
+    dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
diff --git a/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py b/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..557892bcc4911912d7e5d585cb0d27235cf08cd5
--- /dev/null
+++ b/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
@@ -0,0 +1,30 @@
+_base_ = [
+    '../_base_/models/mobilevit/mobilevit_xs.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255],
+    # use bgr directly
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=288, edge='short'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+    batch_size=128,
+    dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
diff --git a/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py b/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..74aea82f32bd65fd71962c588384e4a1e6ab43ea
--- /dev/null
+++ b/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
@@ -0,0 +1,30 @@
+_base_ = [
+    '../_base_/models/mobilevit/mobilevit_xxs.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255],
+    # use bgr directly
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=288, edge='short'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+    batch_size=128,
+    dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
diff --git a/configs/mocov2/README.md b/configs/mocov2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cb0ae4ee7468f3294b28157eafb32cb04b63814d
--- /dev/null
+++ b/configs/mocov2/README.md
@@ -0,0 +1,85 @@
+# MoCoV2
+
+> [Improved Baselines with Momentum Contrastive Learning](https://arxiv.org/abs/2003.04297)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR’s design improvements by implementing them in the MoCo framework. With simple modifications to MoCo—namely, using an MLP projection head and more data augmentation—we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/149720067-b65e5736-d425-45b3-93ed-6f2427fc6217.png" width="500" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mocov2_resnet50_8xb32-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                   | Params (M) | Flops (G) |                       Config                       |                                           Download                                           |
+| :-------------------------------------- | :--------: | :-------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `mocov2_resnet50_8xb32-coslr-200e_in1k` |   55.93    |   4.11    | [config](mocov2_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k` | [MOCOV2](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth) |   25.56    |   4.11    |   67.50   | [config](benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.json) |
+
+## Citation
+
+```bibtex
+@article{chen2020improved,
+  title={Improved baselines with momentum contrastive learning},
+  author={Chen, Xinlei and Fan, Haoqi and Girshick, Ross and He, Kaiming},
+  journal={arXiv preprint arXiv:2003.04297},
+  year={2020}
+}
+```
diff --git a/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py b/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..37795d9c866c5f9b26b0e016959a01677b8a216e
--- /dev/null
+++ b/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_steplr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=30., momentum=0.9, weight_decay=0.))
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov2/metafile.yml b/configs/mocov2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..4440db45b5a1a6ab8352c589471cbd4b6d6bb786
--- /dev/null
+++ b/configs/mocov2/metafile.yml
@@ -0,0 +1,45 @@
+Collections:
+  - Name: MoCoV2
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+        - MoCo
+    Paper:
+      Title: Improved Baselines with Momentum Contrastive Learning
+      URL: https://arxiv.org/abs/2003.04297
+    README: configs/mocov2/README.md
+
+Models:
+  - Name: mocov2_resnet50_8xb32-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 55933312
+      Training Data: ImageNet-1k
+    In Collection: MoCoV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth
+    Config: configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k
+  - Name: resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: MoCoV2
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 67.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth
+    Config: configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
diff --git a/configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py b/configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8037d075a2e5a8490dc4c3709f274784a6f3f4f0
--- /dev/null
+++ b/configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_mocov2.py',
+    '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='MoCo',
+    queue_len=65536,
+    feat_dim=128,
+    momentum=0.001,
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='BN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='MoCoV2Neck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        with_avg_pool=True),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.2))
+
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/mocov3/README.md b/configs/mocov3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a9477e8a6da037a4e773bcb693b0f449f8e8fda7
--- /dev/null
+++ b/configs/mocov3/README.md
@@ -0,0 +1,96 @@
+# MoCoV3
+
+> [An Empirical Study of Training Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.02057)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/151305362-e6e8ea35-b3b8-45f6-8819-634e67083218.png" width="500" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mocov3_resnet50_8xb512-amp-coslr-100e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                              | Params (M) | Flops (G) |                            Config                             |                                Download                                |
+| :------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :--------------------------------------------------------------------: |
+| `mocov3_resnet50_8xb512-amp-coslr-100e_in1k`       |   68.01    |   4.11    |    [config](mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py)    | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.json) |
+| `mocov3_resnet50_8xb512-amp-coslr-300e_in1k`       |   68.01    |   4.11    |    [config](mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py)    | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.json) |
+| `mocov3_resnet50_8xb512-amp-coslr-800e_in1k`       |   68.01    |   4.11    |    [config](mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py)    | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.json) |
+| `mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k` |   84.27    |   4.61    | [config](mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.json) |
+| `mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k`  |   215.68   |   17.58   | [config](mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.json) |
+| `mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k`  |   652.78   |   61.60   | [config](mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth) |   25.56    |   4.11    |   69.60   | [config](benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.json) |
+| `resnet50_mocov3-300e-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3 300-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.pth) |   25.56    |   4.11    |   72.80   | [config](benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-d21ddac2.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-d21ddac2.json) |
+| `resnet50_mocov3-800e-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.pth) |   25.56    |   4.11    |   74.40   | [config](benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-0e97a483.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-0e97a483.json) |
+| `vit-small-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.pth) |   22.05    |   4.61    |   73.60   | [config](benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k_20220826-376674ef.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k_20220826-376674ef.json) |
+| `vit-base-p16_mocov3-pre_8xb64-coslr-150e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth) |   86.57    |   17.58   |   83.00   | [config](benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k_20220826-f1e6c442.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k_20220826-f1e6c442.json) |
+| `vit-base-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth) |   86.57    |   17.58   |   76.90   | [config](benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k_20220826-83be7758.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k_20220826-83be7758.json) |
+| `vit-large-p16_mocov3-pre_8xb64-coslr-100e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.pth) |   304.33   |   61.60   |   83.70   | [config](benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k_20220829-878a2f7f.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k_20220829-878a2f7f.json) |
+
+## Citation
+
+```bibtex
+@InProceedings{Chen_2021_ICCV,
+    title     = {An Empirical Study of Training Self-Supervised Vision Transformers},
+    author    = {Chen, Xinlei and Xie, Saining and He, Kaiming},
+    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+    year      = {2021}
+}
+```
diff --git a/configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py b/configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d0b202b0f643c51e5d931cbf1ee59793aae03cb
--- /dev/null
+++ b/configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_coslr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.4, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py b/configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..91509fc05d6b6274a4bf5237d27d9e28ee365b9d
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='base',  # embed_dim = 768
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True,
+        frozen_stages=12,
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(type='Normal', std=0.01, layer='Linear'),
+    ))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=12, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py b/configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3d074f6ed93a4f5b108c441d00b12cb51802a62
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW', lr=5e-4, eps=1e-8, betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=145,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=150,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=150)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+randomness = dict(seed=0)
diff --git a/configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py b/configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..46d7f48299edfa39316eeb137c71d72d3a7955b7
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.5,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW', lr=5e-4, eps=1e-8, betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+randomness = dict(seed=0)
diff --git a/configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py b/configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c1ffa1972641194beff66d2e4ccfa31e5426fca
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='mocov3-small',  # embed_dim = 384
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True,
+        frozen_stages=12,
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(type='Normal', std=0.01, layer='Linear'),
+    ))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=12, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov3/metafile.yml b/configs/mocov3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..649d9f439e65f18b7b1613a861113425cba480ae
--- /dev/null
+++ b/configs/mocov3/metafile.yml
@@ -0,0 +1,201 @@
+Collections:
+  - Name: MoCoV3
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 32x V100 GPUs
+      Architecture:
+        - ResNet
+        - ViT
+        - MoCo
+    Paper:
+      Title: An Empirical Study of Training Self-Supervised Vision Transformers
+      URL: https://arxiv.org/abs/2104.02057
+    README: configs/mocov3/README.md
+
+Models:
+  - Name: mocov3_resnet50_8xb512-amp-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 68012160
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth
+    Config: configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
+    Downstream:
+      - resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k
+  - Name: mocov3_resnet50_8xb512-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 68012160
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.pth
+    Config: configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py
+    Downstream:
+      - resnet50_mocov3-300e-pre_8xb128-linear-coslr-90e_in1k
+  - Name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      Epochs: 800
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 68012160
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.pth
+    Config: configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - resnet50_mocov3-800e-pre_8xb128-linear-coslr-90e_in1k
+  - Name: resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.6
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.pth
+    Config: configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
+  - Name: resnet50_mocov3-300e-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-d21ddac2.pth
+    Config: configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
+  - Name: resnet50_mocov3-800e-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.4
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-0e97a483.pth
+    Config: configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
+  - Name: mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 4607954304
+      Parameters: 84266752
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.pth
+    Config: configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-small-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+  - Name: vit-small-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 4607954304
+      Parameters: 22050664
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.6
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k_20220826-376674ef.pth
+    Config: configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py
+  - Name: mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 215678464
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth
+    Config: configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+      - vit-base-p16_mocov3-pre_8xb64-coslr-150e_in1k
+  - Name: vit-base-p16_mocov3-pre_8xb64-coslr-150e_in1k
+    Metadata:
+      Epochs: 150
+      Batch Size: 512
+      FLOPs: 17581972224
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.0
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k_20220826-f1e6c442.pth
+    Config: configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py
+  - Name: vit-base-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 17581972224
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.9
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k_20220826-83be7758.pth
+    Config: configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py
+  - Name: mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 61603111936
+      Parameters: 652781568
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.pth
+    Config: configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-large-p16_mocov3-pre_8xb64-coslr-100e_in1k
+  - Name: vit-large-p16_mocov3-pre_8xb64-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 512
+      FLOPs: 61603111936
+      Parameters: 304326632
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k_20220829-878a2f7f.pth
+    Config: configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py
diff --git a/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4eabccad9017df0cb3838f423091365c30a7e12
--- /dev/null
+++ b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
@@ -0,0 +1,82 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+temperature = 1.0
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,  # 0.01 for 100e and 300e, 0.004 for 1000e
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=2,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=True),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=False,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='LARS', lr=9.6, weight_decay=1e-6, momentum=0.9),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc0e4141032b0f8cbe82af08b653db9849013a36
--- /dev/null
+++ b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,82 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+temperature = 1.0
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,  # 0.01 for 100e and 300e, 0.004 for 1000e
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=2,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=True),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=False,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='LARS', lr=4.8, weight_decay=1e-6, momentum=0.9),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..87f18e350ca2209fd2958a867ea6bf9887c695e5
--- /dev/null
+++ b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,82 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+temperature = 1.0
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.004,  # 0.01 for 100e and 300e, 0.004 for 800 and 1000e
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=2,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=True),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=False,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='LARS', lr=4.8, weight_decay=1.5e-6, momentum=0.9),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=790,
+        by_epoch=True,
+        begin=10,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b18fda74d646fbc6c85a0c95d70f52d91712142
--- /dev/null
+++ b/configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,151 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+# the difference between ResNet50 and ViT pipeline is the `scale` in
+# `RandomResizedCrop`, `scale=(0.08, 1.)` in ViT pipeline
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2),
+    dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+
+# model settings
+temperature = 0.2
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='base',  # embed_dim = 768
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=768,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=3,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=False),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=True,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='AdamW', lr=2.4e-3, weight_decay=0.1))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae31c6d8c9540640591a668be09f3cc670970283
--- /dev/null
+++ b/configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py
@@ -0,0 +1,154 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+# the difference between ResNet50 and ViT pipeline is the `scale` in
+# `RandomResizedCrop`, `scale=(0.08, 1.)` in ViT pipeline
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2),
+    dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=64, dataset=dict(pipeline=train_pipeline))
+
+# model settings
+temperature = 0.2
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='large',  # embed_dim = 1024
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=1024,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=3,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=False),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=True,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    clip_grad=dict(max_norm=5.0, error_if_nonfinite=False),
+    optimizer=dict(type='AdamW', lr=2.4e-3, weight_decay=0.1))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+randomness = dict(seed=0)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d26eec77d847c5f7fdb02b20bea224b43ce393d
--- /dev/null
+++ b/configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,151 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+# the difference between ResNet50 and ViT pipeline is the `scale` in
+# `RandomResizedCrop`, `scale=(0.08, 1.)` in ViT pipeline
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2),
+    dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+
+# model settings
+temperature = 0.2
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='mocov3-small',  # embed_dim = 384
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=384,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=3,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=False),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=True,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='AdamW', lr=2.4e-3, weight_decay=0.1))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mvit/README.md b/configs/mvit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1bf72e5e4cbb71c8ba548d9a730b0180e47fbc37
--- /dev/null
+++ b/configs/mvit/README.md
@@ -0,0 +1,85 @@
+# MViT V2
+
+> [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video
+classification, as well as object detection. We present an improved version of MViT that incorporates
+decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture
+in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where
+it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where
+it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art
+performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as
+well as 86.1% on Kinetics-400 video classification.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/180376227-755243fa-158e-4068-940a-416036519665.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mvitv2-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mvitv2-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mvit/mvitv2-tiny_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                 |                                       Download                                       |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------: | :----------------------------------------------------------------------------------: |
+| `mvitv2-tiny_3rdparty_in1k`\*  | From scratch |   24.17    |   4.70    |   82.33   |   96.15   | [config](mvitv2-tiny_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth) |
+| `mvitv2-small_3rdparty_in1k`\* | From scratch |   34.87    |   7.00    |   83.63   |   96.51   | [config](mvitv2-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-small_3rdparty_in1k_20220722-986bd741.pth) |
+| `mvitv2-base_3rdparty_in1k`\*  | From scratch |   51.47    |   10.16   |   84.34   |   96.86   | [config](mvitv2-base_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-base_3rdparty_in1k_20220722-9c4f0a17.pth) |
+| `mvitv2-large_3rdparty_in1k`\* | From scratch |   217.99   |   43.87   |   85.25   |   97.14   | [config](mvitv2-large_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-large_3rdparty_in1k_20220722-2b57b983.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/mvit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{li2021improved,
+  title={MViTv2: Improved multiscale vision transformers for classification and detection},
+  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
+  booktitle={CVPR},
+  year={2022}
+}
+```
diff --git a/configs/mvit/metafile.yml b/configs/mvit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..c16f4f8871562637e7251eb2950bd72d3fee7df7
--- /dev/null
+++ b/configs/mvit/metafile.yml
@@ -0,0 +1,95 @@
+Collections:
+  - Name: MViT V2
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - GELU
+        - Layer Normalization
+        - Scaled Dot-Product Attention
+        - Attention Pooling
+    Paper:
+      URL: http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf
+      Title: 'MViTv2: Improved Multiscale Vision Transformers for Classification and Detection'
+    README: configs/mvit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.24.0/mmcls/models/backbones/mvit.py
+      Version: v0.24.0
+
+Models:
+  - Name: mvitv2-tiny_3rdparty_in1k
+    In Collection: MViT V2
+    Metadata:
+      FLOPs: 4703510768
+      Parameters: 24173320
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 82.33
+        Top 5 Accuracy: 96.15
+    Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_T_in1k.pyth
+      Code: https://github.com/facebookresearch/mvit
+    Config: configs/mvit/mvitv2-tiny_8xb256_in1k.py
+
+  - Name: mvitv2-small_3rdparty_in1k
+    In Collection: MViT V2
+    Metadata:
+      FLOPs: 6997555136
+      Parameters: 34870216
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.63
+        Top 5 Accuracy: 96.51
+    Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-small_3rdparty_in1k_20220722-986bd741.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_S_in1k.pyth
+      Code: https://github.com/facebookresearch/mvit
+    Config: configs/mvit/mvitv2-small_8xb256_in1k.py
+
+  - Name: mvitv2-base_3rdparty_in1k
+    In Collection: MViT V2
+    Metadata:
+      FLOPs: 10157964400
+      Parameters: 51472744
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.34
+        Top 5 Accuracy: 96.86
+    Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-base_3rdparty_in1k_20220722-9c4f0a17.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_B_in1k.pyth
+      Code: https://github.com/facebookresearch/mvit
+    Config: configs/mvit/mvitv2-base_8xb256_in1k.py
+
+  - Name: mvitv2-large_3rdparty_in1k
+    In Collection: MViT V2
+    Metadata:
+      FLOPs: 43868151412
+      Parameters: 217992952
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.25
+        Top 5 Accuracy: 97.14
+    Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-large_3rdparty_in1k_20220722-2b57b983.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_L_in1k.pyth
+      Code: https://github.com/facebookresearch/mvit
+    Config: configs/mvit/mvitv2-large_8xb256_in1k.py
diff --git a/configs/mvit/mvitv2-base_8xb256_in1k.py b/configs/mvit/mvitv2-base_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee3ec11e2bc9873e21b58f0e3e940b5d9fc1e4d5
--- /dev/null
+++ b/configs/mvit/mvitv2-base_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/mvit/mvitv2-base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-4),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.rel_pos_h': dict(decay_mult=0.0),
+            '.rel_pos_w': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mvit/mvitv2-large_8xb256_in1k.py b/configs/mvit/mvitv2-large_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..eacddf96e9f9ab6b0da3f3edec973d69d41d1c9b
--- /dev/null
+++ b/configs/mvit/mvitv2-large_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/mvit/mvitv2-large.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-4),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.rel_pos_h': dict(decay_mult=0.0),
+            '.rel_pos_w': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mvit/mvitv2-small_8xb256_in1k.py b/configs/mvit/mvitv2-small_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..74cfd0a357a7ab773f5ac27404bbc0b78b06f901
--- /dev/null
+++ b/configs/mvit/mvitv2-small_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/mvit/mvitv2-small.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-4),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.rel_pos_h': dict(decay_mult=0.0),
+            '.rel_pos_w': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mvit/mvitv2-tiny_8xb256_in1k.py b/configs/mvit/mvitv2-tiny_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e563a2c9840fe27ae7ba4425976b540b40d21bc
--- /dev/null
+++ b/configs/mvit/mvitv2-tiny_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/mvit/mvitv2-tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-4),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.rel_pos_h': dict(decay_mult=0.0),
+            '.rel_pos_w': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/ofa/README.md b/configs/ofa/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..22e20f8bd85d41ed7faa1794273aeec002311f17
--- /dev/null
+++ b/configs/ofa/README.md
@@ -0,0 +1,88 @@
+# OFA
+
+> [OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework](https://arxiv.org/abs/2202.03052)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/236164275-2429bf20-6e2a-4325-acc2-6117f9b53a53.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('ofa-base_3rdparty-finetuned_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a dog and a kitten sitting next to each other'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/ofa/ofa-base_finetuned_refcoco.py https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_refcoco_20230418-2797d3ab.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                                   | Params (M) | BLEU-4 | CIDER  |                 Config                  |                                               Download                                               |
+| :-------------------------------------- | :--------: | :----: | :----: | :-------------------------------------: | :--------------------------------------------------------------------------------------------------: |
+| `ofa-base_3rdparty-finetuned_caption`\* |   182.24   | 42.64  | 144.50 | [config](ofa-base_finetuned_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-caption_20230418-de18914e.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/OFA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Grounding on RefCOCO
+
+| Model                                   | Params (M) | Accuracy (testA) | Accuracy (testB) |                 Config                  |                                     Download                                     |
+| :-------------------------------------- | :--------: | :--------------: | :--------------: | :-------------------------------------: | :------------------------------------------------------------------------------: |
+| `ofa-base_3rdparty-finetuned_refcoco`\* |   182.24   |      90.49       |      83.63       | [config](ofa-base_finetuned_refcoco.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_refcoco_20230418-2797d3ab.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/OFA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Question Answering on VQAv2
+
+| Model                               | Params (M) | Accuracy |               Config                |                                                     Download                                                     |
+| :---------------------------------- | :--------: | :------: | :---------------------------------: | :--------------------------------------------------------------------------------------------------------------: |
+| `ofa-base_3rdparty-finetuned_vqa`\* |   182.24   |  78.00   | [config](ofa-base_finetuned_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-vqa_20230418-f38539a5.pth) |
+| `ofa-base_3rdparty-zeroshot_vqa`\*  |   182.24   |  58.32   | [config](ofa-base_zeroshot_vqa.py)  | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_pretrain_20230418-dccfc07f.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/OFA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{wang2022ofa,
+  author    = {Peng Wang and
+               An Yang and
+               Rui Men and
+               Junyang Lin and
+               Shuai Bai and
+               Zhikang Li and
+               Jianxin Ma and
+               Chang Zhou and
+               Jingren Zhou and
+               Hongxia Yang},
+  title     = {OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence
+               Learning Framework},
+  journal   = {CoRR},
+  volume    = {abs/2202.03052},
+  year      = {2022}
+}
+```
diff --git a/configs/ofa/metafile.yml b/configs/ofa/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..9c4b3ebf72b766ae64b89bc22ab60c616159af1d
--- /dev/null
+++ b/configs/ofa/metafile.yml
@@ -0,0 +1,89 @@
+Collections:
+  - Name: OFA
+    Metadata:
+      Architecture:
+        - ResNet
+        - Transformer
+      Training Data:
+        - CC12M
+        - CC3M
+        - SBU
+        - COCO
+        - VG
+        - VQAv2
+        - GQA
+        - RefCOCO
+        - OpenImages
+        - Object365
+        - YFCC100M
+        - ImageNet-21K
+        - Pile
+    Paper:
+      Title: 'OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
+        Sequence-to-Sequence Learning Framework'
+      URL: https://arxiv.org/abs/2202.03052
+    README: configs/ofa/README.md
+
+Models:
+  - Name: ofa-base_3rdparty-finetuned_refcoco
+    Metadata:
+      FLOPs: null
+      Parameters: 182238536
+    In Collection: OFA
+    Results:
+      - Task: Visual Grounding
+        Dataset: RefCOCO
+        Metrics:
+          Accuracy (testA): 90.49
+          Accuracy (testB): 83.63
+    Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_refcoco_20230418-2797d3ab.pth
+    Config: configs/ofa/ofa-base_finetuned_refcoco.py
+    Converted From:
+      Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_base_best.pt
+      Code: https://github.com/OFA-Sys/OFA
+  - Name: ofa-base_3rdparty-finetuned_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 182238536
+    In Collection: OFA
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 78.00   # Report from the official repo
+    Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-vqa_20230418-f38539a5.pth
+    Config: configs/ofa/ofa-base_finetuned_vqa.py
+    Converted From:
+      Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/vqa_large_best.pt
+      Code: https://github.com/OFA-Sys/OFA
+  - Name: ofa-base_3rdparty-finetuned_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 182238536
+    In Collection: OFA
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: 42.64
+          CIDER: 144.50
+    Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-caption_20230418-de18914e.pth
+    Config: configs/ofa/ofa-base_finetuned_caption.py
+    Converted From:
+      Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_base_best.pt
+      Code: https://github.com/OFA-Sys/OFA
+  - Name: ofa-base_3rdparty-zeroshot_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 182238536
+    In Collection: OFA
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 58.32
+    Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_pretrain_20230418-dccfc07f.pth
+    Config: configs/ofa/ofa-base_zeroshot_vqa.py
+    Converted From:
+      Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_base.pt
+      Code: https://github.com/OFA-Sys/OFA
diff --git a/configs/ofa/ofa-base_finetuned_caption.py b/configs/ofa/ofa-base_finetuned_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..45efff06ec8ebd5ecc85dbdf15834819fb07bb38
--- /dev/null
+++ b/configs/ofa/ofa-base_finetuned_caption.py
@@ -0,0 +1,41 @@
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='OFA',
+    task='caption',
+    vocab_size=59457,
+    embedding_dim=768,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=101),
+        num_layers=6,
+    ),
+    decoder_cfg=dict(num_layers=6),
+    generation_cfg=dict(use_cache=True),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(480, 480)),
+    dict(type='PackInputs', meta_keys=('image_id', )),
+]
+
+train_dataloader = None
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-base_finetuned_refcoco.py b/configs/ofa/ofa-base_finetuned_refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a7435dbd467ed71b3ee6a4e2c6020083c180729
--- /dev/null
+++ b/configs/ofa/ofa-base_finetuned_refcoco.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../_base_/datasets/refcoco.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='OFA',
+    task='refcoco',
+    vocab_size=59457,
+    embedding_dim=768,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=101),
+        num_layers=6,
+    ),
+    decoder_cfg=dict(num_layers=6),
+    generation_cfg=dict(use_cache=True),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(512, 512)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_bboxes'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-base_finetuned_vqa.py b/configs/ofa/ofa-base_finetuned_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..b120d091e5b9d1b38a3e0ebd1466f0fed9d0f611
--- /dev/null
+++ b/configs/ofa/ofa-base_finetuned_vqa.py
@@ -0,0 +1,64 @@
+_base_ = [
+    '../_base_/datasets/coco_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+ANS2LABEL = 'https://ofa-beijing.oss-cn-beijing.aliyuncs.com/datasets/vqa_data/trainval_ans2label.pkl'  # noqa: E501
+
+# model settings
+model = dict(
+    type='OFA',
+    task='vqa',
+    vocab_size=59457,
+    embedding_dim=768,
+    ans2label=ANS2LABEL,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=101),
+        num_layers=6,
+        num_heads=12,
+    ),
+    decoder_cfg=dict(
+        num_layers=6,
+        num_heads=12,
+    ),
+    generation_cfg=dict(
+        num_beams=5,
+        max_new_tokens=200,
+        length_penalty=0.,  # VQA doesn't require longer answer.
+        use_cache=True,
+    ),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='OFAAddObjects'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=[
+            'question', 'gt_answer', 'gt_answer_weight', 'decoder_prompt'
+        ],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = None  # Eval only
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-base_zeroshot_vqa.py b/configs/ofa/ofa-base_zeroshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..9890cdd2a48484102877e3f3a946b73fefa6dbae
--- /dev/null
+++ b/configs/ofa/ofa-base_zeroshot_vqa.py
@@ -0,0 +1,42 @@
+_base_ = [
+    '../_base_/datasets/coco_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='OFA',
+    task='vqa',
+    vocab_size=59457,
+    embedding_dim=768,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=101),
+        num_layers=6,
+        num_heads=12,
+    ),
+    decoder_cfg=dict(
+        num_layers=6,
+        num_heads=12,
+    ),
+    generation_cfg=dict(
+        num_beams=20,
+        max_new_tokens=200,
+        length_penalty=0.,  # VQA doesn't require longer answer.
+        use_cache=True,
+    ),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+train_dataloader = None  # Eval only
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-large_zeroshot_vqa.py b/configs/ofa/ofa-large_zeroshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b47121127c21baabbb963ccc8407a27d823cec1
--- /dev/null
+++ b/configs/ofa/ofa-large_zeroshot_vqa.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/datasets/coco_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='OFA',
+    task='vqa',
+    vocab_size=59457,
+    embedding_dim=1024,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=152),
+        num_layers=12,
+        num_heads=16,
+    ),
+    decoder_cfg=dict(
+        num_layers=12,
+        num_heads=16,
+    ),
+    generation_cfg=dict(
+        num_beams=20,
+        max_new_tokens=200,
+        length_penalty=0.,  # VQA doesn't require longer answer.
+        use_cache=True,
+    ),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-large'),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+train_dataloader = None  # Eval only
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/otter/README.md b/configs/otter/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..18a84684f84e61a664c0742ff96ecaa440f2633b
--- /dev/null
+++ b/configs/otter/README.md
@@ -0,0 +1,79 @@
+# Otter
+
+> [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://arxiv.org/abs/2305.03726)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Large language models (LLMs) have demonstrated significant universal capabilities as few/zero-shot learners in various tasks due to their pre-training on vast amounts of text data, as exemplified by GPT-3, which boosted to InstrctGPT and ChatGPT, effectively following natural language instructions to accomplish real-world tasks. In this paper, we propose to introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. We adopt a similar approach to construct our MultI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning. We also optimize OpenFlamingo's implementation for researchers, democratizing the required training resources from 1$\times$ A100 GPU to 4$\times$ RTX-3090 GPUs, and integrate both OpenFlamingo and Otter into Huggingface Transformers for more researchers to incorporate the models into their customized training and inference pipelines.
+
+<div align=center>
+<img src="https://camo.githubusercontent.com/70613ab882a7827808148a2c577029d544371e707b0832a0b01151c54ce553c3/68747470733a2f2f692e706f7374696d672e63632f5477315a304243572f6f7474657276302d322d64656d6f2e706e67" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model, inference_model
+
+model = get_model('otter-9b_3rdparty_caption', pretrained=True, device='cuda', generation_cfg=dict(max_new_tokens=50))
+out = inference_model(model, 'demo/cat-dog.png')
+print(out)
+# {'pred_caption': 'The image features two adorable small puppies sitting next to each other on the grass. One puppy is brown and white, while the other is tan and white. They appear to be relaxing outdoors, enjoying each other'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/otter/otter-9b_caption.py https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                         | Params (M) |  BLEU-4  |  CIDER   |            Config             |                                                 Download                                                 |
+| :---------------------------- | :--------: | :------: | :------: | :---------------------------: | :------------------------------------------------------------------------------------------------------: |
+| `otter-9b_3rdparty_caption`\* |  8220.45   | Upcoming | Upcoming | [config](otter-9b_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Luodian/Otter/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Question Answering on VQAv2
+
+| Model                     | Params (M) | Accuracy |          Config           |                                                 Download                                                 |
+| :------------------------ | :--------: | :------: | :-----------------------: | :------------------------------------------------------------------------------------------------------: |
+| `otter-9b_3rdparty_vqa`\* |  8220.45   | Upcoming | [config](otter-9b_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Luodian/Otter/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{li2023otter,
+  title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},
+  author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},
+  journal={arXiv preprint arXiv:2305.03726},
+  year={2023}
+}
+
+@article{li2023mimicit,
+    title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
+    author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
+    year={2023},
+    eprint={2306.05425},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/otter/metafile.yml b/configs/otter/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6ee89c62a4d073b5eada03e8f9fbb3508041b8d5
--- /dev/null
+++ b/configs/otter/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+  - Name: Otter
+    Metadata:
+      Architecture:
+        - Transformer
+        - Gated Cross-Attention Dense
+    Paper:
+      Title: 'Otter: A Multi-Modal Model with In-Context Instruction Tuning'
+      URL: https://arxiv.org/abs/2305.03726
+    README: configs/otter/README.md
+
+Models:
+  - Name: otter-9b_3rdparty_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Otter
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: null
+          CIDER: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth
+    Config: configs/otter/otter-9b_caption.py
+    Converted From:
+      Weights: https://huggingface.co/luodian/otter-9b-hf
+      Code: https://github.com/Luodian/Otter/tree/main
+  - Name: otter-9b_3rdparty_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Otter
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth
+    Config: configs/otter/otter-9b_vqa.py
+    Converted From:
+      Weights: https://huggingface.co/luodian/otter-9b-hf
+      Code: https://github.com/Luodian/Otter/tree/main
diff --git a/configs/otter/otter-9b_caption.py b/configs/otter/otter-9b_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..e35e92ef40cabcccd35f17dd661199b04a76dd6b
--- /dev/null
+++ b/configs/otter/otter-9b_caption.py
@@ -0,0 +1,87 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Otter',
+    tokenizer=dict(type='LlamaTokenizer', name_or_path='huggyllama/llama-7b'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='huggyllama/llama-7b',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False,
+            only_attend_previous=True,
+        ),
+    ),
+    task='caption',
+    final_prompt_tmpl='<image>User:Please describe the image. GPT:<answer>',
+    generation_cfg=dict(
+        num_beams=3, max_new_tokens=24, no_repeat_ngram_size=3),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption'],
+        meta_keys=['image_id'],
+    ),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/otter/otter-9b_vqa.py b/configs/otter/otter-9b_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..72f2b64281126cbf71a81929b12318b0a00f9e36
--- /dev/null
+++ b/configs/otter/otter-9b_vqa.py
@@ -0,0 +1,104 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Otter',
+    tokenizer=dict(type='LlamaTokenizer', name_or_path='huggyllama/llama-7b'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='huggyllama/llama-7b',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False,
+            only_attend_previous=True,
+        ),
+    ),
+    task='vqa',
+    final_prompt_tmpl='<image>User:{question} GPT:<answer>',
+    generation_cfg=dict(
+        num_beams=3, max_new_tokens=24, no_repeat_ngram_size=3),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+        meta_keys=['image_id'],
+    ),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/poolformer/README.md b/configs/poolformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2c4b249329ea03662f768aa350a08fb8eebc763b
--- /dev/null
+++ b/configs/poolformer/README.md
@@ -0,0 +1,80 @@
+# PoolFormer
+
+> [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 49%/61% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of "MetaFormer", a general architecture abstracted from transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/15921929/144710761-1635f59a-abde-4946-984c-a2c3f22a19d2.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('poolformer-s12_3rdparty_32xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('poolformer-s12_3rdparty_32xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/poolformer/poolformer-s12_32xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                    |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                                Download                                 |
+| :--------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :---------------------------------------------------------------------: |
+| `poolformer-s12_3rdparty_32xb128_in1k`\* | From scratch |   11.92    |   1.87    |   77.24   |   93.51   | [config](poolformer-s12_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth) |
+| `poolformer-s24_3rdparty_32xb128_in1k`\* | From scratch |   21.39    |   3.51    |   80.33   |   95.05   | [config](poolformer-s24_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s24_3rdparty_32xb128_in1k_20220414-d7055904.pth) |
+| `poolformer-s36_3rdparty_32xb128_in1k`\* | From scratch |   30.86    |   5.15    |   81.43   |   95.45   | [config](poolformer-s36_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s36_3rdparty_32xb128_in1k_20220414-d78ff3e8.pth) |
+| `poolformer-m36_3rdparty_32xb128_in1k`\* | From scratch |   56.17    |   8.96    |   82.14   |   95.71   | [config](poolformer-m36_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m36_3rdparty_32xb128_in1k_20220414-c55e0949.pth) |
+| `poolformer-m48_3rdparty_32xb128_in1k`\* | From scratch |   73.47    |   11.80   |   82.51   |   95.95   | [config](poolformer-m48_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m48_3rdparty_32xb128_in1k_20220414-9378f3eb.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/sail-sg/poolformer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{yu2022metaformer,
+  title={Metaformer is actually what you need for vision},
+  author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
+  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
+  pages={10819--10829},
+  year={2022}
+}
+```
diff --git a/configs/poolformer/metafile.yml b/configs/poolformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..55285ddd0230270030f25bef09b1461dc7278dc3
--- /dev/null
+++ b/configs/poolformer/metafile.yml
@@ -0,0 +1,99 @@
+Collections:
+  - Name: PoolFormer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Pooling
+        - 1x1 Convolution
+        - LayerScale
+    Paper:
+      URL: https://arxiv.org/abs/2111.11418
+      Title: MetaFormer is Actually What You Need for Vision
+    README: configs/poolformer/README.md
+    Code:
+      Version: v0.22.1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.22.1/mmcls/models/backbones/poolformer.py
+
+Models:
+  - Name: poolformer-s12_3rdparty_32xb128_in1k
+    Metadata:
+      FLOPs: 1871399424
+      Parameters: 11915176
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.24
+          Top 5 Accuracy: 93.51
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth
+    Config: configs/poolformer/poolformer-s12_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s12.pth.tar
+      Code: https://github.com/sail-sg/poolformer
+  - Name: poolformer-s24_3rdparty_32xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3510411008
+      Parameters: 21388968
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.33
+          Top 5 Accuracy: 95.05
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s24_3rdparty_32xb128_in1k_20220414-d7055904.pth
+    Config: configs/poolformer/poolformer-s24_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s24.pth.tar
+      Code: https://github.com/sail-sg/poolformer
+  - Name: poolformer-s36_3rdparty_32xb128_in1k
+    Metadata:
+      FLOPs: 5149422592
+      Parameters: 30862760
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.43
+          Top 5 Accuracy: 95.45
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s36_3rdparty_32xb128_in1k_20220414-d78ff3e8.pth
+    Config: configs/poolformer/poolformer-s36_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s36.pth.tar
+      Code: https://github.com/sail-sg/poolformer
+  - Name: poolformer-m36_3rdparty_32xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 8960175744
+      Parameters: 56172520
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.14
+          Top 5 Accuracy: 95.71
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m36_3rdparty_32xb128_in1k_20220414-c55e0949.pth
+    Config: configs/poolformer/poolformer-m36_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m36.pth.tar
+      Code: https://github.com/sail-sg/poolformer
+  - Name: poolformer-m48_3rdparty_32xb128_in1k
+    Metadata:
+      FLOPs: 11801805696
+      Parameters: 73473448
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.51
+          Top 5 Accuracy: 95.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m48_3rdparty_32xb128_in1k_20220414-9378f3eb.pth
+    Config: configs/poolformer/poolformer-m48_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m48.pth.tar
+      Code: https://github.com/sail-sg/poolformer
diff --git a/configs/poolformer/poolformer-m36_32xb128_in1k.py b/configs/poolformer/poolformer-m36_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..13065b8cf5100b4d16696d54cfa8c0a727541831
--- /dev/null
+++ b/configs/poolformer/poolformer-m36_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_m36.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-m48_32xb128_in1k.py b/configs/poolformer/poolformer-m48_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2078df39c4a16783b8f1a7ffc5c5da2b346eb1f0
--- /dev/null
+++ b/configs/poolformer/poolformer-m48_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_m48.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-s12_32xb128_in1k.py b/configs/poolformer/poolformer-s12_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7cf4a6365604def73f2ea293b857ebdc8b2ed9b3
--- /dev/null
+++ b/configs/poolformer/poolformer-s12_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_s12.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-s24_32xb128_in1k.py b/configs/poolformer/poolformer-s24_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffb2482d16c3e432c1f3d0a233a69a76b99efdd8
--- /dev/null
+++ b/configs/poolformer/poolformer-s24_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_s24.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-s36_32xb128_in1k.py b/configs/poolformer/poolformer-s36_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..842dab3ac51645046d15f04b8bc1ace42781144b
--- /dev/null
+++ b/configs/poolformer/poolformer-s36_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_s36.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/regnet/README.md b/configs/regnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..63031f4e89b934d823ce53f08cdbad597729fd7e
--- /dev/null
+++ b/configs/regnet/README.md
@@ -0,0 +1,88 @@
+# RegNet
+
+> [Designing Network Design Spaces](https://arxiv.org/abs/2003.13678)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142572813-5dad3317-9d58-4177-971f-d346e01fb3c4.png" width=60%/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('regnetx-400mf_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('regnetx-400mf_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/regnet/regnetx-400mf_8xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/regnet/regnetx-400mf_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                       |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                 Config                 |                                        Download                                        |
+| :-------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------: | :------------------------------------------------------------------------------------: |
+| `regnetx-400mf_8xb128_in1k` | From scratch |    5.16    |   0.41    |   72.56   |   90.78   | [config](regnetx-400mf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211208_143316.json) |
+| `regnetx-800mf_8xb128_in1k` | From scratch |    7.26    |   0.81    |   74.76   |   92.32   | [config](regnetx-800mf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-800mf_8xb128_in1k_20211213-222b0f11.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-800mf_8xb128_in1k_20211207_143037.log.json) |
+| `regnetx-1.6gf_8xb128_in1k` | From scratch |    9.19    |   1.63    |   76.84   |   93.31   | [config](regnetx-1.6gf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211213-d1b89758.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211208_143018.log.json) |
+| `regnetx-3.2gf_8xb64_in1k`  | From scratch |    3.21    |   1.53    |   78.09   |   94.08   | [config](regnetx-3.2gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-3.2gf_8xb64_in1k_20211213-1fdd82ae.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-3.2gf_8xb64_in1k_20211208_142720.log.json) |
+| `regnetx-4.0gf_8xb64_in1k`  | From scratch |   22.12    |   4.00    |   78.60   |   94.17   | [config](regnetx-4.0gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211213-efed675c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211207_150431.log.json) |
+| `regnetx-6.4gf_8xb64_in1k`  | From scratch |   26.21    |   6.51    |   79.38   |   94.65   | [config](regnetx-6.4gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-6.4gf_8xb64_in1k_20211215-5c6089da.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-6.4gf_8xb64_in1k_20211213_172748.log.json) |
+| `regnetx-8.0gf_8xb64_in1k`  | From scratch |   39.57    |   8.03    |   79.12   |   94.51   | [config](regnetx-8.0gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-8.0gf_8xb64_in1k_20211213-9a9fcc76.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-8.0gf_8xb64_in1k_20211208_103250.log.json) |
+| `regnetx-12gf_8xb64_in1k`   | From scratch |   46.11    |   12.15   |   79.67   |   95.03   |  [config](regnetx-12gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211213-5df8c2f8.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211208_143713.log.json) |
+
+## Citation
+
+```bibtex
+@article{radosavovic2020designing,
+    title={Designing Network Design Spaces},
+    author={Ilija Radosavovic and Raj Prateek Kosaraju and Ross Girshick and Kaiming He and Piotr Dollár},
+    year={2020},
+    eprint={2003.13678},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/regnet/metafile.yml b/configs/regnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..4796a9f42a19092e956b3511467b84b26e372b99
--- /dev/null
+++ b/configs/regnet/metafile.yml
@@ -0,0 +1,122 @@
+Collections:
+  - Name: RegNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Neural Architecture Search
+        - Design Space Design
+        - Precise BN
+        - SGD with nesterov
+    Paper:
+      URL: https://arxiv.org/abs/2003.13678
+      Title: Designing Network Design Spaces
+    README: configs/regnet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.18.0/mmcls/models/backbones/regnet.py
+      Version: v0.18.0
+
+Models:
+  - Name: regnetx-400mf_8xb128_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-400mf_8xb128_in1k.py
+    Metadata:
+      FLOPs: 410000000     # 0.41G
+      Parameters: 5160000  # 5.16M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 72.56
+        Top 5 Accuracy: 90.78
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth
+  - Name: regnetx-800mf_8xb128_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-800mf_8xb128_in1k.py
+    Metadata:
+      FLOPs: 810000000     # 0.81G
+      Parameters: 7260000  # 7.26M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 74.76
+        Top 5 Accuracy: 92.32
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-800mf_8xb128_in1k_20211213-222b0f11.pth
+  - Name: regnetx-1.6gf_8xb128_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-1.6gf_8xb128_in1k.py
+    Metadata:
+      FLOPs: 1630000000     # 1.63G
+      Parameters: 9190000   # 9.19M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 76.84
+        Top 5 Accuracy: 93.31
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211213-d1b89758.pth
+  - Name: regnetx-3.2gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-3.2gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 1530000000     # 1.53G
+      Parameters: 3210000   # 32.1M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.09
+        Top 5 Accuracy: 94.08
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-3.2gf_8xb64_in1k_20211213-1fdd82ae.pth
+  - Name: regnetx-4.0gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-4.0gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 4000000000     # 4G
+      Parameters: 22120000  # 22.12M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.60
+        Top 5 Accuracy: 94.17
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211213-efed675c.pth
+  - Name: regnetx-6.4gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-6.4gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 6510000000      # 6.51G
+      Parameters: 26210000   # 26.21M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.38
+        Top 5 Accuracy: 94.65
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-6.4gf_8xb64_in1k_20211215-5c6089da.pth
+  - Name: regnetx-8.0gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-8.0gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 8030000000     # 8.03G
+      Parameters: 39570000  # 39.57M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.12
+        Top 5 Accuracy: 94.51
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-8.0gf_8xb64_in1k_20211213-9a9fcc76.pth
+  - Name: regnetx-12gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-12gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 12150000000      # 12.15G
+      Parameters: 46110000    # 46.11M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.67
+        Top 5 Accuracy: 95.03
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211213-5df8c2f8.pth
diff --git a/configs/regnet/regnetx-1.6gf_8xb128_in1k.py b/configs/regnet/regnetx-1.6gf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3e9e934fede12e5c06673dc12898db35654cf2a
--- /dev/null
+++ b/configs/regnet/regnetx-1.6gf_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_1.6gf'),
+    head=dict(in_channels=912, ))
diff --git a/configs/regnet/regnetx-12gf_8xb64_in1k.py b/configs/regnet/regnetx-12gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a2c0b5aa15ec760c461bf46d6ff9537c68f0fa4
--- /dev/null
+++ b/configs/regnet/regnetx-12gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_12gf'),
+    head=dict(in_channels=2240, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-3.2gf_8xb64_in1k.py b/configs/regnet/regnetx-3.2gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a78478d6df89eee57960f239069192a7d529682e
--- /dev/null
+++ b/configs/regnet/regnetx-3.2gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_3.2gf'),
+    head=dict(in_channels=1008, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-4.0gf_8xb64_in1k.py b/configs/regnet/regnetx-4.0gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfc241fe0c8469ae3b8d522b7da7fb2da49f39de
--- /dev/null
+++ b/configs/regnet/regnetx-4.0gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_4.0gf'),
+    head=dict(in_channels=1360, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-400mf_8xb128_in1k.py b/configs/regnet/regnetx-400mf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bad16785c04ad49db3b125fdcb343aa4c559cdd9
--- /dev/null
+++ b/configs/regnet/regnetx-400mf_8xb128_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+    '../_base_/models/regnet/regnetx_400mf.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs1024_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+data_preprocessor = dict(
+    # BGR format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    to_rgb=False,  # The checkpoints from PyCls requires BGR format inputs.
+)
+
+# lighting params, in order of BGR, from repo. pycls
+EIGVAL = [0.2175, 0.0188, 0.0045]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.814],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=25.5,  # because the value range of images is [0,255]
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128)
+test_dataloader = dict(batch_size=128)
+
+# schedule settings
+
+# sgd with nesterov, base ls is 0.8 for batch_size 1024,
+optim_wrapper = dict(optimizer=dict(lr=0.8, nesterov=True))
+
+# runtime settings
+
+# Precise BN hook will update the bn stats, so this hook should be executed
+# before CheckpointHook(priority of 'VERY_LOW') and
+# EMAHook(priority of 'NORMAL') So set the priority of PreciseBNHook to
+# 'ABOVENORMAL' here.
+custom_hooks = [
+    dict(
+        type='PreciseBNHook',
+        num_samples=8192,
+        interval=1,
+        priority='ABOVE_NORMAL')
+]
diff --git a/configs/regnet/regnetx-6.4gf_8xb64_in1k.py b/configs/regnet/regnetx-6.4gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..edb1fb8e482cd51f44c377c493f00c3e6d7185ad
--- /dev/null
+++ b/configs/regnet/regnetx-6.4gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_6.4gf'),
+    head=dict(in_channels=1624, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-8.0gf_8xb64_in1k.py b/configs/regnet/regnetx-8.0gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..04b75bbe25987a6b10a984f264288e6c90b29719
--- /dev/null
+++ b/configs/regnet/regnetx-8.0gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_8.0gf'),
+    head=dict(in_channels=1920, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-800mf_8xb128_in1k.py b/configs/regnet/regnetx-800mf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cd71379a108703f5ca3ce7f4f156227085045aa
--- /dev/null
+++ b/configs/regnet/regnetx-800mf_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_800mf'),
+    head=dict(in_channels=672, ))
diff --git a/configs/replknet/README.md b/configs/replknet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3d312f24aa95837c056892cea315458749558206
--- /dev/null
+++ b/configs/replknet/README.md
@@ -0,0 +1,108 @@
+# RepLKNet
+
+> [Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs](https://arxiv.org/abs/2203.06717)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm. We suggested five guidelines, e.g., applying re-parameterized large depth-wise convolutions, to design efficient highperformance large-kernel CNNs. Following the guidelines, we propose RepLKNet, a pure CNN architecture whose kernel size is as large as 31×31, in contrast to commonly used 3×3. RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer on ImageNet and a few typical downstream tasks, with lower latency. RepLKNet also shows nice scalability to big data and large models, obtaining 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K, which is very competitive among the state-of-the-arts with similar model sizes. Our study further reveals that, in contrast to small-kernel CNNs, large kernel CNNs have much larger effective receptive fields and higher shape bias rather than texture bias.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/48375204/197546040-cdf078c3-7fbd-400f-8b27-01668c8dfebf.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model, get_model
+
+model = get_model('replknet-31B_3rdparty_in1k', pretrained=True)
+model.backbone.switch_to_deploy()
+predict = inference_model(model, 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('replknet-31B_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/replknet/replknet-31B_32xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-fd08e268.pth
+```
+
+**Reparameterization**
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+To use reparameterized weights, the config file must switch to the deploy config files.
+
+```bash
+python tools/test.py ${deploy_cfg} ${deploy_checkpoint} --metrics accuracy
+```
+
+You can also use `backbone.switch_to_deploy()` to switch to the deploy mode in Python code. For example:
+
+```python
+from mmpretrain.models import RepLKNet
+
+backbone = RepLKNet(arch='31B')
+backbone.switch_to_deploy()
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                    |                            Download                            |
+| :--------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :------------------------------------------------------------: |
+| `replknet-31B_3rdparty_in1k`\*                 | From scratch |   79.86    |   15.64   |   83.48   |   96.57   |    [config](replknet-31B_32xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-fd08e268.pth) |
+| `replknet-31B_3rdparty_in1k-384px`\*           | From scratch |   79.86    |   45.95   |   84.84   |   97.34   | [config](replknet-31B_32xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k-384px_20221118-03a170ce.pth) |
+| `replknet-31B_in21k-pre_3rdparty_in1k`\*       | ImageNet-21k |   79.86    |   15.64   |   85.20   |   97.56   |    [config](replknet-31B_32xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k_20221118-54ed5c46.pth) |
+| `replknet-31B_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   79.86    |   45.95   |   85.99   |   97.75   | [config](replknet-31B_32xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k-384px_20221118-76c92b24.pth) |
+| `replknet-31L_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   172.67   |   97.24   |   86.63   |   98.00   | [config](replknet-31L_32xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31L_in21k-pre_3rdparty_in1k-384px_20221118-dc3fc07c.pth) |
+| `replknet-XL_meg73m-pre_3rdparty_in1k-320px`\* |    MEG73M    |   335.44   |  129.57   |   87.57   |   98.39   | [config](replknet-XL_32xb64_in1k-320px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-XL_meg73m-pre_3rdparty_in1k-320px_20221118-88259b1d.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{ding2022scaling,
+  title={Scaling up your kernels to 31x31: Revisiting large kernel design in cnns},
+  author={Ding, Xiaohan and Zhang, Xiangyu and Han, Jungong and Ding, Guiguang},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={11963--11975},
+  year={2022}
+}
+```
diff --git a/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a14fe63efafbff3f249a2e4d5b2c96de931c6c1f
--- /dev/null
+++ b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-31B_32xb64_in1k-384px.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f92c494f8afd0d494e199de20f26af7ce151aa1
--- /dev/null
+++ b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-31B_32xb64_in1k.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py b/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..63e590f9786173d879b1f4390c91392f1df45bec
--- /dev/null
+++ b/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-31L_32xb64_in1k-384px.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py b/configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0a8ed5f8f30aea7e53811ae63767187d5494bc6
--- /dev/null
+++ b/configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-XL_32xb64_in1k-320px.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/metafile.yml b/configs/replknet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f9f37449778415e0de57394adb457c8bc57c9e2b
--- /dev/null
+++ b/configs/replknet/metafile.yml
@@ -0,0 +1,129 @@
+Collections:
+  - Name: RepLKNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Large-Kernel Convolution
+        - VGG-style Neural Network
+    Paper:
+      URL: https://arxiv.org/abs/2203.06717
+      Title: 'Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs'
+    README: configs/replknet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/backbones/replknet.py
+      Version: v1.0.0rc3
+
+Models:
+  - Name: replknet-31B_3rdparty_in1k
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31B_32xb64_in1k.py
+    Metadata:
+      FLOPs: 15636547584
+      Parameters: 79864168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.48
+        Top 5 Accuracy: 96.57
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-fd08e268.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1azQUiCxK9feYVkkrPqwVPBtNsTzDrX7S&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-31B_3rdparty_in1k-384px
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31B_32xb64_in1k-384px.py
+    Metadata:
+      FLOPs: 45952303104
+      Parameters: 79864168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.84
+        Top 5 Accuracy: 97.34
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k-384px_20221118-03a170ce.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1vo-P3XB6mRLUeDzmgv90dOu73uCeLfZN&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-31B_in21k-pre_3rdparty_in1k
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31B_32xb64_in1k.py
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 15636547584
+      Parameters: 79864168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.20
+        Top 5 Accuracy: 97.56
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k_20221118-54ed5c46.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1DslZ2voXZQR1QoFY9KnbsHAeF84hzS0s&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-31B_in21k-pre_3rdparty_in1k-384px
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31B_32xb64_in1k-384px.py
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 45952303104
+      Parameters: 79864168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.99
+        Top 5 Accuracy: 97.75
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k-384px_20221118-76c92b24.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1Sc46BWdXXm2fVP-K_hKKU_W8vAB-0duX&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-31L_in21k-pre_3rdparty_in1k-384px
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31L_32xb64_in1k-384px.py
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 97240006656
+      Parameters: 172671016
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 86.63
+        Top 5 Accuracy: 98.00
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31L_in21k-pre_3rdparty_in1k-384px_20221118-dc3fc07c.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1JYXoNHuRvC33QV1pmpzMTKEni1hpWfBl&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-XL_meg73m-pre_3rdparty_in1k-320px
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-XL_32xb64_in1k-320px.py
+    Metadata:
+      Training Data:
+        - MegData-73M
+        - ImageNet-1k
+      FLOPs: 129570201600
+      Parameters: 335435752
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 87.57
+        Top 5 Accuracy: 98.39
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-XL_meg73m-pre_3rdparty_in1k-320px_20221118-88259b1d.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1tPC60El34GntXByIRHb-z-Apm4Y5LX1T&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
diff --git a/configs/replknet/replknet-31B_32xb64_in1k-384px.py b/configs/replknet/replknet-31B_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e714f347a40101f2baf41a0723181a8502af85a
--- /dev/null
+++ b/configs/replknet/replknet-31B_32xb64_in1k-384px.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/replknet-31B_in1k.py',
+    '../_base_/datasets/imagenet_bs16_pil_bicubic_384.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/replknet/replknet-31B_32xb64_in1k.py b/configs/replknet/replknet-31B_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf06f2d86a39450574747d670f4bb9a7dfffaca6
--- /dev/null
+++ b/configs/replknet/replknet-31B_32xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/replknet-31B_in1k.py',
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/replknet/replknet-31L_32xb64_in1k-384px.py b/configs/replknet/replknet-31L_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cdab249fefba7b7878211479b682768538c4b27
--- /dev/null
+++ b/configs/replknet/replknet-31L_32xb64_in1k-384px.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/replknet-31L_in1k.py',
+    '../_base_/datasets/imagenet_bs16_pil_bicubic_384.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/replknet/replknet-XL_32xb64_in1k-320px.py b/configs/replknet/replknet-XL_32xb64_in1k-320px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b0aab114e725e822dbffb99a637cc9e770a91e7
--- /dev/null
+++ b/configs/replknet/replknet-XL_32xb64_in1k-320px.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/replknet-XL_in1k.py',
+    '../_base_/datasets/imagenet_bs8_pil_bicubic_320.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/repmlp/README.md b/configs/repmlp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..41dfa234bd09153695a09af39b3901e536ca19b6
--- /dev/null
+++ b/configs/repmlp/README.md
@@ -0,0 +1,103 @@
+# RepMLP
+
+> [RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition](https://arxiv.org/abs/2105.01883)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition).
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/155455288-a17a5c48-11af-4b74-995a-cf7183f0e2d2.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model, get_model
+
+model = get_model('repmlp-base_3rdparty_8xb64_in1k', pretrained=True)
+model.backbone.switch_to_deploy()
+predict = inference_model(model, 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('repmlp-base_3rdparty_8xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/repmlp/repmlp-base_8xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth
+```
+
+**Reparameterization**
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+To use reparameterized weights, the config file must switch to the deploy config files.
+
+```bash
+python tools/test.py ${deploy_cfg} ${deploy_checkpoint} --metrics accuracy
+```
+
+You can also use `backbone.switch_to_deploy()` to switch to the deploy mode in Python code. For example:
+
+```python
+from mmpretrain.models import RepMLPNet
+
+backbone = RepMLPNet(arch='B', img_size=224, reparam_conv_kernels=(1, 3))
+backbone.switch_to_deploy()
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                               Download                                |
+| :---------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :-------------------------------------------------------------------: |
+| `repmlp-base_3rdparty_8xb64_in1k`\*       | From scratch |   68.24    |   6.71    |   80.41   |   95.14   |    [config](repmlp-base_8xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth) |
+| `repmlp-base_3rdparty_8xb64_in1k-256px`\* | From scratch |   96.45    |   9.69    |   81.11   |   95.50   | [config](repmlp-base_8xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k-256px_20220330-7c5a91ce.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepMLP/blob/072d8516beba83d75dfe6ebb12f625abad4b53d5/repmlpnet.py#L278). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{ding2021repmlp,
+  title={Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition},
+  author={Ding, Xiaohan and Xia, Chunlong and Zhang, Xiangyu and Chu, Xiaojie and Han, Jungong and Ding, Guiguang},
+  journal={arXiv preprint arXiv:2105.01883},
+  year={2021}
+}
+```
diff --git a/configs/repmlp/metafile.yml b/configs/repmlp/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..7f391e04b7cfc2b3ffc93dbd2a781e6b201d1cde
--- /dev/null
+++ b/configs/repmlp/metafile.yml
@@ -0,0 +1,48 @@
+Collections:
+  - Name: RepMLP
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Multi-layer Perceptron
+        - Re-parameterization Convolution
+    Paper:
+      URL: https://arxiv.org/abs/2105.01883
+      Title: 'RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition'
+    README: configs/repmlp/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.21.0/mmcls/models/backbones/repmlp.py
+      Version: v0.21.0
+
+Models:
+  - Name: repmlp-base_3rdparty_8xb64_in1k
+    In Collection: RepMLP
+    Config: configs/repmlp/repmlp-base_8xb64_in1k.py
+    Metadata:
+      FLOPs: 6710000000  # 6.71 G
+      Parameters: 68240000  # 68.24 M
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.41
+          Top 5 Accuracy: 95.14
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth
+    Converted From:
+      Weights: https://github.com/DingXiaoH/RepMLP
+      Code: https://github.com/DingXiaoH/RepMLP/blob/072d8516beba83d75dfe6ebb12f625abad4b53d5/repmlpnet.py#L274
+  - Name: repmlp-base_3rdparty_8xb64_in1k-256px
+    In Collection: RepMLP
+    Config: configs/repmlp/repmlp-base_8xb64_in1k-256px.py
+    Metadata:
+      FLOPs: 9690000000  # 9.69 G
+      Parameters: 96450000  # 96.45M
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.11
+          Top 5 Accuracy: 95.50
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k-256px_20220330-7c5a91ce.pth
+    Converted From:
+      Weights: https://github.com/DingXiaoH/RepMLP
+      Code: https://github.com/DingXiaoH/RepMLP/blob/072d8516beba83d75dfe6ebb12f625abad4b53d5/repmlpnet.py#L278
diff --git a/configs/repmlp/repmlp-base_8xb64_in1k-256px.py b/configs/repmlp/repmlp-base_8xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..81dc55a204918dec83b31c80cd37125a4ce3bb27
--- /dev/null
+++ b/configs/repmlp/repmlp-base_8xb64_in1k-256px.py
@@ -0,0 +1,36 @@
+_base_ = [
+    '../_base_/models/repmlp-base_224.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=256))
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=256),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=292, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/repmlp/repmlp-base_8xb64_in1k.py b/configs/repmlp/repmlp-base_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..666ce405440d7c764a0959900cc3650f329cc019
--- /dev/null
+++ b/configs/repmlp/repmlp-base_8xb64_in1k.py
@@ -0,0 +1,26 @@
+_base_ = [
+    '../_base_/models/repmlp-base_224.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    # resizing to (256, 256) here, different from resizing shorter edge to 256
+    dict(type='Resize', scale=(256, 256), backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/repmlp/repmlp-base_delopy_8xb64_in1k.py b/configs/repmlp/repmlp-base_delopy_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5b2c882341421f225b0b3ca0b57e2efd6c06e07
--- /dev/null
+++ b/configs/repmlp/repmlp-base_delopy_8xb64_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['./repmlp-base_8xb64_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/repmlp/repmlp-base_deploy_8xb64_in1k-256px.py b/configs/repmlp/repmlp-base_deploy_8xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..27ff50a02dc65c56162e7f851506f00dbb6bc8da
--- /dev/null
+++ b/configs/repmlp/repmlp-base_deploy_8xb64_in1k-256px.py
@@ -0,0 +1,3 @@
+_base_ = ['./repmlp-base_8xb64_in1k-256px.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/repvgg/README.md b/configs/repvgg/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9a47f9d1e0a56a027072b661aef54225f1423205
--- /dev/null
+++ b/configs/repvgg/README.md
@@ -0,0 +1,142 @@
+# RepVGG
+
+> [RepVGG: Making VGG-style ConvNets Great Again](https://arxiv.org/abs/2101.03697)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+RepVGG is a VGG-style convolutional architecture. It has the following advantages:
+
+1. The model has a VGG-like plain (a.k.a. feed-forward) topology 1 without any branches. I.e., every layer takes the output of its only preceding layer as input and feeds the output into its only following layer.
+2. The model’s body uses only 3 × 3 conv and ReLU.
+3. The concrete architecture (including the specific depth and layer widths) is instantiated with no automatic search, manual refinement, compound scaling, nor other heavy designs.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142573223-f7f14d32-ea08-43a1-81ad-5a6a83ee0122.png" width="60%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model, get_model
+
+model = get_model('repvgg-A0_8xb32_in1k', pretrained=True)
+model.backbone.switch_to_deploy()
+predict = inference_model(model, 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('repvgg-A0_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/repvgg/repvgg-A0_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/repvgg/repvgg-A0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.pth
+```
+
+Test with reparameterized model:
+
+```shell
+python tools/test.py configs/repvgg/repvgg-A0_8xb32_in1k.py repvgg_A0_deploy.pth --cfg-options model.backbone.deploy=True
+```
+
+**Reparameterization**
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+To use reparameterized weights, the config file must switch to the deploy config files.
+
+```bash
+python tools/test.py ${deploy_cfg} ${deploy_checkpoint} --metrics accuracy
+```
+
+You can also use `backbone.switch_to_deploy()` to switch to the deploy mode in Python code. For example:
+
+```python
+from mmpretrain.models import RepVGG
+
+backbone = RepVGG(arch='A0')
+backbone.switch_to_deploy()
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                                        Download                                         |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------------------------------------: |
+| `repvgg-A0_8xb32_in1k`        | From scratch |    8.31    |   1.36    |   72.37   |   90.56   |  [config](repvgg-A0_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.log) |
+| `repvgg-A1_8xb32_in1k`        | From scratch |   12.79    |   2.36    |   74.23   |   91.80   |  [config](repvgg-A1_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_8xb32_in1k_20221213-f81bf3df.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_8xb32_in1k_20221213-f81bf3df.log) |
+| `repvgg-A2_8xb32_in1k`        | From scratch |   25.50    |   5.12    |   76.49   |   93.09   |  [config](repvgg-A2_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_8xb32_in1k_20221213-a8767caf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_8xb32_in1k_20221213-a8767caf.log) |
+| `repvgg-B0_8xb32_in1k`        | From scratch |    3.42    |   15.82   |   75.27   |   92.21   |  [config](repvgg-B0_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_8xb32_in1k_20221213-5091ecc7.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_8xb32_in1k_20221213-5091ecc7.log) |
+| `repvgg-B1_8xb32_in1k`        | From scratch |   51.83    |   11.81   |   78.19   |   94.04   |  [config](repvgg-B1_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_8xb32_in1k_20221213-d17c45e7.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_8xb32_in1k_20221213-d17c45e7.log) |
+| `repvgg-B1g2_8xb32_in1k`      | From scratch |   41.36    |   8.81    |   77.87   |   93.99   | [config](repvgg-B1g2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_8xb32_in1k_20221213-ae6428fd.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_8xb32_in1k_20221213-ae6428fd.log) |
+| `repvgg-B1g4_8xb32_in1k`      | From scratch |   36.13    |   7.30    |   77.81   |   93.77   | [config](repvgg-B1g4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_8xb32_in1k_20221213-a7a4aaea.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_8xb32_in1k_20221213-a7a4aaea.log) |
+| `repvgg-B2_8xb32_in1k`        | From scratch |   80.32    |   18.37   |   78.58   |   94.23   |  [config](repvgg-B2_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_8xb32_in1k_20221213-d8b420ef.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_8xb32_in1k_20221213-d8b420ef.log) |
+| `repvgg-B2g4_8xb32_in1k`      | From scratch |   55.78    |   11.33   |   79.44   |   94.72   | [config](repvgg-B2g4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_8xb32_in1k_20221213-0c1990eb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_8xb32_in1k_20221213-0c1990eb.log) |
+| `repvgg-B3_8xb32_in1k`        | From scratch |   110.96   |   26.21   |   80.58   |   95.33   |  [config](repvgg-B3_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_8xb32_in1k_20221213-927a329a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_8xb32_in1k_20221213-927a329a.log) |
+| `repvgg-B3g4_8xb32_in1k`      | From scratch |   75.63    |   16.06   |   80.26   |   95.15   | [config](repvgg-B3g4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_8xb32_in1k_20221213-e01cb280.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_8xb32_in1k_20221213-e01cb280.log) |
+| `repvgg-D2se_3rdparty_in1k`\* | From scratch |   120.39   |   32.84   |   81.81   |   95.94   | [config](repvgg-D2se_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-D2se_3rdparty_4xb64-autoaug-lbs-mixup-coslr-200e_in1k_20210909-cf3139b7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepVGG/blob/9f272318abfc47a2b702cd0e916fca8d25d683e7/repvgg.py#L250). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{ding2021repvgg,
+  title={Repvgg: Making vgg-style convnets great again},
+  author={Ding, Xiaohan and Zhang, Xiangyu and Ma, Ningning and Han, Jungong and Ding, Guiguang and Sun, Jian},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={13733--13742},
+  year={2021}
+}
+```
diff --git a/configs/repvgg/metafile.yml b/configs/repvgg/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e93250ae2288b2ace58081bdcc24fc80c2f3c5b5
--- /dev/null
+++ b/configs/repvgg/metafile.yml
@@ -0,0 +1,175 @@
+Collections:
+  - Name: RepVGG
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - re-parameterization Convolution
+        - VGG-style Neural Network
+    Paper:
+      URL: https://arxiv.org/abs/2101.03697
+      Title: 'RepVGG: Making VGG-style ConvNets Great Again'
+    README: configs/repvgg/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.16.0/mmcls/models/backbones/repvgg.py#L257
+      Version: v0.16.0
+
+Models:
+  - Name: repvgg-A0_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-A0_8xb32_in1k.py
+    Metadata:
+      FLOPs: 1360233728
+      Parameters: 8309384
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 72.37
+        Top 5 Accuracy: 90.56
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.pth
+  - Name: repvgg-A1_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-A1_8xb32_in1k.py
+    Metadata:
+      FLOPs: 2362750208
+      Parameters: 12789864
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 74.23
+        Top 5 Accuracy: 91.80
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_8xb32_in1k_20221213-f81bf3df.pth
+  - Name: repvgg-A2_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-A2_8xb32_in1k.py
+    Metadata:
+      FLOPs: 5115612544
+      Parameters: 25499944
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 76.49
+        Top 5 Accuracy: 93.09
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_8xb32_in1k_20221213-a8767caf.pth
+  - Name: repvgg-B0_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B0_8xb32_in1k.py
+    Metadata:
+      FLOPs: 15820000000
+      Parameters: 3420000
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 75.27
+        Top 5 Accuracy: 92.21
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_8xb32_in1k_20221213-5091ecc7.pth
+  - Name: repvgg-B1_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B1_8xb32_in1k.py
+    Metadata:
+      FLOPs: 11813537792
+      Parameters: 51829480
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.19
+        Top 5 Accuracy: 94.04
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_8xb32_in1k_20221213-d17c45e7.pth
+  - Name: repvgg-B1g2_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B1g2_8xb32_in1k.py
+    Metadata:
+      FLOPs: 8807794688
+      Parameters: 41360104
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 77.87
+        Top 5 Accuracy: 93.99
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_8xb32_in1k_20221213-ae6428fd.pth
+  - Name: repvgg-B1g4_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B1g4_8xb32_in1k.py
+    Metadata:
+      FLOPs: 7304923136
+      Parameters: 36125416
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 77.81
+        Top 5 Accuracy: 93.77
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_8xb32_in1k_20221213-a7a4aaea.pth
+  - Name: repvgg-B2_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B2_8xb32_in1k.py
+    Metadata:
+      FLOPs: 18374175232
+      Parameters: 80315112
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.58
+        Top 5 Accuracy: 94.23
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_8xb32_in1k_20221213-d8b420ef.pth
+  - Name: repvgg-B2g4_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B2g4_8xb32_in1k.py
+    Metadata:
+      FLOPs: 11329464832
+      Parameters: 55777512
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.44
+        Top 5 Accuracy: 94.72
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_8xb32_in1k_20221213-0c1990eb.pth
+  - Name: repvgg-B3_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B3_8xb32_in1k.py
+    Metadata:
+      FLOPs: 26206448128
+      Parameters: 110960872
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 80.58
+        Top 5 Accuracy: 95.33
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_8xb32_in1k_20221213-927a329a.pth
+  - Name: repvgg-B3g4_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B3g4_8xb32_in1k.py
+    Metadata:
+      FLOPs: 16062065152
+      Parameters: 75626728
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 80.26
+        Top 5 Accuracy: 95.15
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_8xb32_in1k_20221213-e01cb280.pth
+  - Name: repvgg-D2se_3rdparty_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-D2se_8xb32_in1k.py
+    Metadata:
+      FLOPs: 32838581760
+      Parameters: 120387572
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 81.81
+        Top 5 Accuracy: 95.94
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-D2se_3rdparty_4xb64-autoaug-lbs-mixup-coslr-200e_in1k_20210909-cf3139b7.pth
+    Converted From:
+      Weights: https://drive.google.com/drive/folders/1Avome4KvNp0Lqh2QwhXO6L5URQjzCjUq
+      Code: https://github.com/DingXiaoH/RepVGG/blob/9f272318abfc47a2b702cd0e916fca8d25d683e7/repvgg.py#L250
diff --git a/configs/repvgg/repvgg-A0_8xb32_in1k.py b/configs/repvgg/repvgg-A0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b767ae2a3e4062563cec782385baafdf6181baf3
--- /dev/null
+++ b/configs/repvgg/repvgg-A0_8xb32_in1k.py
@@ -0,0 +1,33 @@
+_base_ = [
+    '../_base_/models/repvgg-A0_in1k.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        custom_keys={
+            'branch_3x3.norm': dict(decay_mult=0.0),
+            'branch_1x1.norm': dict(decay_mult=0.0),
+            'branch_norm.bias': dict(decay_mult=0.0),
+        }))
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR',
+    T_max=120,
+    by_epoch=True,
+    begin=0,
+    end=120,
+    convert_to_iter_based=True)
+
+train_cfg = dict(by_epoch=True, max_epochs=120)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/repvgg/repvgg-A0_deploy_in1k.py b/configs/repvgg/repvgg-A0_deploy_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..897e2bb36e9ad8197b4889f22530a32a79fef055
--- /dev/null
+++ b/configs/repvgg/repvgg-A0_deploy_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/repvgg/repvgg-A1_8xb32_in1k.py b/configs/repvgg/repvgg-A1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fab5e586359370dd59a7ba55b91511541e922a11
--- /dev/null
+++ b/configs/repvgg/repvgg-A1_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='A1'))
diff --git a/configs/repvgg/repvgg-A2_8xb32_in1k.py b/configs/repvgg/repvgg-A2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6196f02fbfedb36e9e498160884eeb7315513f6
--- /dev/null
+++ b/configs/repvgg/repvgg-A2_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='A2'), head=dict(in_channels=1408))
diff --git a/configs/repvgg/repvgg-B0_8xb32_in1k.py b/configs/repvgg/repvgg-B0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bbc4ab2259ccd929eae948cae0f676b7fca4b74
--- /dev/null
+++ b/configs/repvgg/repvgg-B0_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B0'), head=dict(in_channels=1280))
diff --git a/configs/repvgg/repvgg-B1_8xb32_in1k.py b/configs/repvgg/repvgg-B1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e08db3c4b8145cd3141851a7b41bbbe4fbfff776
--- /dev/null
+++ b/configs/repvgg/repvgg-B1_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B1'), head=dict(in_channels=2048))
diff --git a/configs/repvgg/repvgg-B1g2_8xb32_in1k.py b/configs/repvgg/repvgg-B1g2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1c53fded4e0ff0c59038fb82ca8cb0ca3e41742
--- /dev/null
+++ b/configs/repvgg/repvgg-B1g2_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B1g2'), head=dict(in_channels=2048))
diff --git a/configs/repvgg/repvgg-B1g4_8xb32_in1k.py b/configs/repvgg/repvgg-B1g4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0757b1e580e5091b9d5c633cd87c856a526ebdf0
--- /dev/null
+++ b/configs/repvgg/repvgg-B1g4_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B1g4'), head=dict(in_channels=2048))
diff --git a/configs/repvgg/repvgg-B2_8xb32_in1k.py b/configs/repvgg/repvgg-B2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9a7d4ca5570518f0c4d0b81951e0e97c46606f9
--- /dev/null
+++ b/configs/repvgg/repvgg-B2_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B2'), head=dict(in_channels=2560))
diff --git a/configs/repvgg/repvgg-B2g4_8xb32_in1k.py b/configs/repvgg/repvgg-B2g4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b3397881d74785870c266f1212cfee364dab38d
--- /dev/null
+++ b/configs/repvgg/repvgg-B2g4_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-B3_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B2g4'), head=dict(in_channels=2560))
diff --git a/configs/repvgg/repvgg-B3_8xb32_in1k.py b/configs/repvgg/repvgg-B3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9d5257838c9e2061dfbe39aa2b1456820009ff3
--- /dev/null
+++ b/configs/repvgg/repvgg-B3_8xb32_in1k.py
@@ -0,0 +1,67 @@
+_base_ = [
+    '../_base_/models/repvgg-B3_lbs-mixup_in1k.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        custom_keys={
+            'branch_3x3.norm': dict(decay_mult=0.0),
+            'branch_1x1.norm': dict(decay_mult=0.0),
+            'branch_norm.bias': dict(decay_mult=0.0),
+        }))
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR',
+    T_max=200,
+    by_epoch=True,
+    begin=0,
+    end=200,
+    convert_to_iter_based=True)
+
+train_cfg = dict(by_epoch=True, max_epochs=200)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/repvgg/repvgg-B3g4_8xb32_in1k.py b/configs/repvgg/repvgg-B3g4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0c5c00af845f5e4f02b44105095f78835f35096
--- /dev/null
+++ b/configs/repvgg/repvgg-B3g4_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-B3_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B3g4'))
diff --git a/configs/repvgg/repvgg-D2se_8xb32_in1k.py b/configs/repvgg/repvgg-D2se_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f532dcd79686a119e1bed528a1e7c36195e70857
--- /dev/null
+++ b/configs/repvgg/repvgg-D2se_8xb32_in1k.py
@@ -0,0 +1,28 @@
+_base_ = './repvgg-B3_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='D2se'), head=dict(in_channels=2560))
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/res2net/README.md b/configs/res2net/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..68b1acce79c18d994d2e310392a75a4b74db6078
--- /dev/null
+++ b/configs/res2net/README.md
@@ -0,0 +1,78 @@
+# Res2Net
+
+> [Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/abs/1904.01169)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142573547-cde68abf-287b-46db-a848-5cffe3068faf.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('res2net50-w14-s8_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('res2net50-w14-s8_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/res2net/res2net50-w14-s8_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                               Download                                |
+| :---------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :-------------------------------------------------------------------: |
+| `res2net50-w14-s8_3rdparty_8xb32_in1k`\*  | From scratch |   25.06    |   4.22    |   78.14   |   93.85   | [config](res2net50-w14-s8_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth) |
+| `res2net50-w26-s8_3rdparty_8xb32_in1k`\*  | From scratch |   48.40    |   8.39    |   79.20   |   94.36   | [config](res2net50-w26-s8_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w26-s8_3rdparty_8xb32_in1k_20210927-f547a94b.pth) |
+| `res2net101-w26-s4_3rdparty_8xb32_in1k`\* | From scratch |   45.21    |   8.12    |   79.19   |   94.44   | [config](res2net101-w26-s4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net101-w26-s4_3rdparty_8xb32_in1k_20210927-870b6c36.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L181). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{gao2019res2net,
+  title={Res2Net: A New Multi-scale Backbone Architecture},
+  author={Gao, Shang-Hua and Cheng, Ming-Ming and Zhao, Kai and Zhang, Xin-Yu and Yang, Ming-Hsuan and Torr, Philip},
+  journal={IEEE TPAMI},
+  year={2021},
+  doi={10.1109/TPAMI.2019.2938758},
+}
+```
diff --git a/configs/res2net/metafile.yml b/configs/res2net/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..b19b102f443998a335362a43b0deb57e0bc264a5
--- /dev/null
+++ b/configs/res2net/metafile.yml
@@ -0,0 +1,70 @@
+Collections:
+  - Name: Res2Net
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Architecture:
+        - Batch Normalization
+        - Convolution
+        - Global Average Pooling
+        - ReLU
+        - Res2Net Block
+    Paper:
+      Title: 'Res2Net: A New Multi-scale Backbone Architecture'
+      URL: https://arxiv.org/abs/1904.01169
+    README: configs/res2net/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.17.0/mmcls/models/backbones/res2net.py
+      Version: v0.17.0
+
+Models:
+  - Name: res2net50-w14-s8_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 4220000000
+      Parameters: 25060000
+    In Collection: Res2Net
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.14
+          Top 5 Accuracy: 93.85
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth
+    Converted From:
+      Weights: https://1drv.ms/u/s!AkxDDnOtroRPdOTqhF8ne_aakDI?e=EVb8Ri
+      Code: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L221
+    Config: configs/res2net/res2net50-w14-s8_8xb32_in1k.py
+  - Name: res2net50-w26-s8_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 8390000000
+      Parameters: 48400000
+    In Collection: Res2Net
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.20
+          Top 5 Accuracy: 94.36
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w26-s8_3rdparty_8xb32_in1k_20210927-f547a94b.pth
+    Converted From:
+      Weights: https://1drv.ms/u/s!AkxDDnOtroRPdTrAd_Afzc26Z7Q?e=slYqsR
+      Code: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L201
+    Config: configs/res2net/res2net50-w26-s8_8xb32_in1k.py
+  - Name: res2net101-w26-s4_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 8120000000
+      Parameters: 45210000
+    In Collection: Res2Net
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.19
+          Top 5 Accuracy: 94.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/res2net/res2net101-w26-s4_3rdparty_8xb32_in1k_20210927-870b6c36.pth
+    Converted From:
+      Weights: https://1drv.ms/u/s!AkxDDnOtroRPcJRgTLkahL0cFYw?e=nwbnic
+      Code: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L181
+    Config: configs/res2net/res2net101-w26-s4_8xb32_in1k.py
diff --git a/configs/res2net/res2net101-w26-s4_8xb32_in1k.py b/configs/res2net/res2net101-w26-s4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ebe9e94d64a305a06dda71c3c20d8c6c77cfc06
--- /dev/null
+++ b/configs/res2net/res2net101-w26-s4_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/res2net101-w26-s4.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/res2net/res2net50-w14-s8_8xb32_in1k.py b/configs/res2net/res2net50-w14-s8_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..56cc02e3b893e4976940badabfa577db471620bc
--- /dev/null
+++ b/configs/res2net/res2net50-w14-s8_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/res2net50-w14-s8.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/res2net/res2net50-w26-s8_8xb32_in1k.py b/configs/res2net/res2net50-w26-s8_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d7dcbeb9164875b21aa782ac5bed5f4618a4363e
--- /dev/null
+++ b/configs/res2net/res2net50-w26-s8_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/res2net50-w26-s8.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnest/README.md b/configs/resnest/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..eb6c5fd728c3032b6b8c429100f1399b8803b765
--- /dev/null
+++ b/configs/resnest/README.md
@@ -0,0 +1,26 @@
+# ResNeSt
+
+> [ResNeSt: Split-Attention Networks](https://arxiv.org/abs/2004.08955)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+It is well known that featuremap attention and multi-path representation are important for visual recognition. In this paper, we present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations. Our design results in a simple and unified computation block, which can be parameterized using only a few variables. Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification. In addition, ResNeSt has achieved superior transfer learning results on several public benchmarks serving as the backbone, and has been adopted by the winning entries of COCO-LVIS challenge. The source code for complete system and pretrained models are publicly available.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142573827-a8189607-614b-4385-b579-b0db148b3db7.png" width="60%"/>
+</div>
+
+## Citation
+
+```
+@misc{zhang2020resnest,
+      title={ResNeSt: Split-Attention Networks},
+      author={Hang Zhang and Chongruo Wu and Zhongyue Zhang and Yi Zhu and Haibin Lin and Zhi Zhang and Yue Sun and Tong He and Jonas Mueller and R. Manmatha and Mu Li and Alexander Smola},
+      year={2020},
+      eprint={2004.08955},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/resnest/_randaug_policies.py b/configs/resnest/_randaug_policies.py
new file mode 100644
index 0000000000000000000000000000000000000000..d650caa2f586045ab76102a5506885e6da2fb4ed
--- /dev/null
+++ b/configs/resnest/_randaug_policies.py
@@ -0,0 +1,92 @@
+policies = [
+    dict(type='AutoContrast', prob=0.5),
+    dict(type='Equalize', prob=0.5),
+    dict(type='Invert', prob=0.5),
+    dict(
+        type='Rotate',
+        magnitude_key='angle',
+        magnitude_range=(0, 30),
+        pad_val=0,
+        prob=0.5,
+        random_negative_prob=0.5),
+    dict(
+        type='Posterize',
+        magnitude_key='bits',
+        magnitude_range=(0, 4),
+        prob=0.5),
+    dict(
+        type='Solarize',
+        magnitude_key='thr',
+        magnitude_range=(0, 256),
+        prob=0.5),
+    dict(
+        type='SolarizeAdd',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 110),
+        thr=128,
+        prob=0.5),
+    dict(
+        type='ColorTransform',
+        magnitude_key='magnitude',
+        magnitude_range=(-0.9, 0.9),
+        prob=0.5,
+        random_negative_prob=0.),
+    dict(
+        type='Contrast',
+        magnitude_key='magnitude',
+        magnitude_range=(-0.9, 0.9),
+        prob=0.5,
+        random_negative_prob=0.),
+    dict(
+        type='Brightness',
+        magnitude_key='magnitude',
+        magnitude_range=(-0.9, 0.9),
+        prob=0.5,
+        random_negative_prob=0.),
+    dict(
+        type='Sharpness',
+        magnitude_key='magnitude',
+        magnitude_range=(-0.9, 0.9),
+        prob=0.5,
+        random_negative_prob=0.),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        pad_val=0,
+        prob=0.5,
+        direction='horizontal',
+        random_negative_prob=0.5),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        pad_val=0,
+        prob=0.5,
+        direction='vertical',
+        random_negative_prob=0.5),
+    dict(
+        type='Cutout',
+        magnitude_key='shape',
+        magnitude_range=(1, 41),
+        pad_val=0,
+        prob=0.5),
+    dict(
+        type='Translate',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        pad_val=0,
+        prob=0.5,
+        direction='horizontal',
+        random_negative_prob=0.5,
+        interpolation='bicubic'),
+    dict(
+        type='Translate',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        pad_val=0,
+        prob=0.5,
+        direction='vertical',
+        random_negative_prob=0.5,
+        interpolation='bicubic')
+]
diff --git a/configs/resnest/resnest101_32xb64_in1k.py b/configs/resnest/resnest101_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac78659147a6fd1a56a89f56ed552ef3736488c4
--- /dev/null
+++ b/configs/resnest/resnest101_32xb64_in1k.py
@@ -0,0 +1,78 @@
+_base_ = [
+    '../_base_/models/resnest101.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/default_runtime.py',
+    './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.8140],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies={{_base_.policies}},
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=256, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=256, backend='pillow'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=265,
+        by_epoch=True,
+        begin=5,
+        end=270,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnest/resnest200_64xb32_in1k.py b/configs/resnest/resnest200_64xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3b9fb3d7dad8357829a820286f27ef0097426b6
--- /dev/null
+++ b/configs/resnest/resnest200_64xb32_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+    '../_base_/models/resnest200.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.8140],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies={{_base_.policies}},
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=320, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=320, backend='pillow'),
+    dict(type='PackInputs'),
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=265,
+        by_epoch=True,
+        begin=5,
+        end=270,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnest/resnest269_64xb32_in1k.py b/configs/resnest/resnest269_64xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e884d63586f8210143ca0bf1e9cf33b2449a4f9
--- /dev/null
+++ b/configs/resnest/resnest269_64xb32_in1k.py
@@ -0,0 +1,78 @@
+_base_ = [
+    '../_base_/models/resnest269.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.8140],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies={{_base_.policies}},
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=416, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=416, backend='pillow'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=265,
+        by_epoch=True,
+        begin=5,
+        end=270,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnest/resnest50_32xb64_in1k.py b/configs/resnest/resnest50_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..05f839b38b669a3093a8a7df7f78f135b88e6b77
--- /dev/null
+++ b/configs/resnest/resnest50_32xb64_in1k.py
@@ -0,0 +1,78 @@
+_base_ = [
+    '../_base_/models/resnest50.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/default_runtime.py',
+    './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.8140],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies={{_base_.policies}},
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=256, backend='pillow'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=265,
+        by_epoch=True,
+        begin=5,
+        end=270,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnet/README.md b/configs/resnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..286b77381a57401607cc52568d1d81b8ba5b4d83
--- /dev/null
+++ b/configs/resnet/README.md
@@ -0,0 +1,140 @@
+# ResNet
+
+> [Deep Residual Learning for Image Recognition](https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of
+learning unreferenced functions. In the mainstream previous works, like VGG, the neural networks are a stack
+of layers and every layer attempts to fit a desired underlying mapping. In ResNets, a few stacked layers are
+grouped as a block, and the layers in a block attempts to learn a residual mapping.
+
+Formally, denoting the desired underlying mapping of a block as $\mathcal{H}(x)$, split the underlying mapping
+into the sum of the identity and the residual mapping as $\mathcal{H}(x) = x + \mathcal{F}(x)$, and let the
+stacked non-linear layers fit the residual mapping $\mathcal{F}(x)$.
+
+Many works proved this method makes deep neural networks easier to optimize, and can gain accuracy from
+considerably increased depth. Recently, the residual structure is widely used in various models.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142574068-60cfdeea-c4ec-4c49-abb2-5dc2facafc3b.png" width="40%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
+
+The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet18_8xb16_cifar10', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet18_8xb16_cifar10', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/resnet/resnet18_8xb16_cifar10.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                              |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                    Config                     |                                 Download                                 |
+| :--------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-------------------------------------------: | :----------------------------------------------------------------------: |
+| `resnet18_8xb32_in1k`              | From scratch |   11.69    |   1.82    |   69.90   |   89.43   |       [config](resnet18_8xb32_in1k.py)        | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.json) |
+| `resnet34_8xb32_in1k`              | From scratch |    2.18    |   3.68    |   73.62   |   91.59   |       [config](resnet34_8xb32_in1k.py)        | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6.json) |
+| `resnet50_8xb32_in1k`              | From scratch |   25.56    |   4.12    |   76.55   |   93.06   |       [config](resnet50_8xb32_in1k.py)        | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.json) |
+| `resnet101_8xb32_in1k`             | From scratch |   44.55    |   7.85    |   77.97   |   94.06   |       [config](resnet101_8xb32_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.json) |
+| `resnet152_8xb32_in1k`             | From scratch |   60.19    |   11.58   |   78.48   |   94.13   |       [config](resnet152_8xb32_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa.json) |
+| `resnetv1d50_8xb32_in1k`           | From scratch |   25.58    |   4.36    |   77.54   |   93.57   |      [config](resnetv1d50_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.json) |
+| `resnetv1d101_8xb32_in1k`          | From scratch |   44.57    |   8.09    |   78.93   |   94.48   |     [config](resnetv1d101_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.json) |
+| `resnetv1d152_8xb32_in1k`          | From scratch |   60.21    |   11.82   |   79.41   |   94.70   |     [config](resnetv1d152_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.json) |
+| `resnet50_8xb32-fp16_in1k`         | From scratch |   25.56    |   4.12    |   76.30   |   93.07   |     [config](resnet50_8xb32-fp16_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-b3964210.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-b3964210.json) |
+| `resnet50_8xb256-rsb-a1-600e_in1k` | From scratch |   25.56    |   4.12    |   80.12   |   94.78   | [config](resnet50_8xb256-rsb-a1-600e_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.json) |
+| `resnet50_8xb256-rsb-a2-300e_in1k` | From scratch |   25.56    |   4.12    |   79.55   |   94.37   | [config](resnet50_8xb256-rsb-a2-300e_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.json) |
+| `resnet50_8xb256-rsb-a3-100e_in1k` | From scratch |   25.56    |   4.12    |   78.30   |   93.80   | [config](resnet50_8xb256-rsb-a3-100e_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.json) |
+| `resnetv1c50_8xb32_in1k`           | From scratch |   25.58    |   4.36    |   77.01   |   93.58   |      [config](resnetv1c50_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd.json) |
+| `resnetv1c101_8xb32_in1k`          | From scratch |   44.57    |   8.09    |   78.30   |   94.27   |     [config](resnetv1c101_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f.json) |
+| `resnetv1c152_8xb32_in1k`          | From scratch |   60.21    |   11.82   |   78.76   |   94.41   |     [config](resnetv1c152_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f.json) |
+
+### Image Classification on CIFAR-10
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) |                Config                |                                              Download                                               |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------: |
+| `resnet18_8xb16_cifar10`  | From scratch |   11.17    |   0.56    |   94.82   | [config](resnet18_8xb16_cifar10.py)  | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.json) |
+| `resnet34_8xb16_cifar10`  | From scratch |   21.28    |   1.16    |   95.34   | [config](resnet34_8xb16_cifar10.py)  | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6.json) |
+| `resnet50_8xb16_cifar10`  | From scratch |   23.52    |   1.31    |   95.55   | [config](resnet50_8xb16_cifar10.py)  | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.json) |
+| `resnet101_8xb16_cifar10` | From scratch |   42.51    |   2.52    |   95.58   | [config](resnet101_8xb16_cifar10.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936.json) |
+| `resnet152_8xb16_cifar10` | From scratch |   58.16    |   3.74    |   95.76   | [config](resnet152_8xb16_cifar10.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178.json) |
+
+### Image Classification on CIFAR-100
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                          Download                                          |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `resnet50_8xb16_cifar100` | From scratch |   23.71    |   1.31    |   79.90   |   95.19   | [config](resnet50_8xb16_cifar100.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b.json) |
+
+### Image Classification on CUB-200-2011
+
+| Model               |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) |             Config             |                                                    Download                                                     |
+| :------------------ | :----------: | :--------: | :-------: | :-------: | :----------------------------: | :-------------------------------------------------------------------------------------------------------------: |
+| `resnet50_8xb8_cub` | From scratch |   23.92    |   16.48   |   88.45   | [config](resnet50_8xb8_cub.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{he2016deep,
+  title={Deep residual learning for image recognition},
+  author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={770--778},
+  year={2016}
+}
+```
diff --git a/configs/resnet/metafile.yml b/configs/resnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..16387248c43aea59c5563b4c6c98df8dd8effead
--- /dev/null
+++ b/configs/resnet/metafile.yml
@@ -0,0 +1,352 @@
+Collections:
+  - Name: ResNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - ResNet
+    Paper:
+      URL: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
+      Title: "Deep Residual Learning for Image Recognition"
+    README: configs/resnet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/resnet.py#L383
+      Version: v0.15.0
+
+Models:
+  - Name: resnet18_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 560000000
+      Parameters: 11170000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 94.82
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth
+    Config: configs/resnet/resnet18_8xb16_cifar10.py
+  - Name: resnet34_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 1160000000
+      Parameters: 21280000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 95.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6.pth
+    Config: configs/resnet/resnet34_8xb16_cifar10.py
+  - Name: resnet50_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 1310000000
+      Parameters: 23520000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 95.55
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth
+    Config: configs/resnet/resnet50_8xb16_cifar10.py
+  - Name: resnet101_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 2520000000
+      Parameters: 42510000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 95.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936.pth
+    Config: configs/resnet/resnet101_8xb16_cifar10.py
+  - Name: resnet152_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 3740000000
+      Parameters: 58160000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 95.76
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178.pth
+    Config: configs/resnet/resnet152_8xb16_cifar10.py
+  - Name: resnet50_8xb16_cifar100
+    Metadata:
+      Training Data: CIFAR-100
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 1310000000
+      Parameters: 23710000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-100
+        Metrics:
+          Top 1 Accuracy: 79.90
+          Top 5 Accuracy: 95.19
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b.pth
+    Config: configs/resnet/resnet50_8xb16_cifar100.py
+  - Name: resnet18_8xb32_in1k
+    Metadata:
+      FLOPs: 1820000000
+      Parameters: 11690000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.90
+          Top 5 Accuracy: 89.43
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.pth
+    Config: configs/resnet/resnet18_8xb32_in1k.py
+  - Name: resnet34_8xb32_in1k
+    Metadata:
+      FLOPs: 3680000000
+      Parameters: 2180000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.62
+          Top 5 Accuracy: 91.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6.pth
+    Config: configs/resnet/resnet34_8xb32_in1k.py
+  - Name: resnet50_8xb32_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.55
+          Top 5 Accuracy: 93.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth
+    Config: configs/resnet/resnet50_8xb32_in1k.py
+  - Name: resnet101_8xb32_in1k
+    Metadata:
+      FLOPs: 7850000000
+      Parameters: 44550000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.97
+          Top 5 Accuracy: 94.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.pth
+    Config: configs/resnet/resnet101_8xb32_in1k.py
+  - Name: resnet152_8xb32_in1k
+    Metadata:
+      FLOPs: 11580000000
+      Parameters: 60190000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.48
+          Top 5 Accuracy: 94.13
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa.pth
+    Config: configs/resnet/resnet152_8xb32_in1k.py
+  - Name: resnetv1d50_8xb32_in1k
+    Metadata:
+      FLOPs: 4360000000
+      Parameters: 25580000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.54
+          Top 5 Accuracy: 93.57
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.pth
+    Config: configs/resnet/resnetv1d50_8xb32_in1k.py
+  - Name: resnetv1d101_8xb32_in1k
+    Metadata:
+      FLOPs: 8090000000
+      Parameters: 44570000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.93
+          Top 5 Accuracy: 94.48
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.pth
+    Config: configs/resnet/resnetv1d101_8xb32_in1k.py
+  - Name: resnetv1d152_8xb32_in1k
+    Metadata:
+      FLOPs: 11820000000
+      Parameters: 60210000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.41
+          Top 5 Accuracy: 94.70
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.pth
+    Config: configs/resnet/resnetv1d152_8xb32_in1k.py
+  - Name: resnet50_8xb32-fp16_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+        - Mixed Precision Training
+    In Collection: ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.30
+          Top 5 Accuracy: 93.07
+    Weights: https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-b3964210.pth
+    Config: configs/resnet/resnet50_8xb32-fp16_in1k.py
+  - Name: resnet50_8xb256-rsb-a1-600e_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+      Training Techniques:
+        - LAMB
+        - Weight Decay
+        - Cosine Annealing
+        - Mixup
+        - CutMix
+        - RepeatAugSampler
+        - RandAugment
+      Epochs: 600
+      Batch Size: 2048
+    In Collection: ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.12
+          Top 5 Accuracy: 94.78
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.pth
+    Config: configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py
+  - Name: resnet50_8xb256-rsb-a2-300e_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+      Training Techniques:
+        - LAMB
+        - Weight Decay
+        - Cosine Annealing
+        - Mixup
+        - CutMix
+        - RepeatAugSampler
+        - RandAugment
+      Epochs: 300
+      Batch Size: 2048
+    In Collection: ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.55
+          Top 5 Accuracy: 94.37
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.pth
+    Config: configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py
+  - Name: resnet50_8xb256-rsb-a3-100e_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+      Training Techniques:
+        - LAMB
+        - Weight Decay
+        - Cosine Annealing
+        - Mixup
+        - CutMix
+        - RandAugment
+      Batch Size: 2048
+    In Collection: ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.30
+          Top 5 Accuracy: 93.80
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.pth
+    Config: configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py
+  - Name: resnetv1c50_8xb32_in1k
+    Metadata:
+      FLOPs: 4360000000
+      Parameters: 25580000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.01
+          Top 5 Accuracy: 93.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd.pth
+    Config: configs/resnet/resnetv1c50_8xb32_in1k.py
+  - Name: resnetv1c101_8xb32_in1k
+    Metadata:
+      FLOPs: 8090000000
+      Parameters: 44570000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.30
+          Top 5 Accuracy: 94.27
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f.pth
+    Config: configs/resnet/resnetv1c101_8xb32_in1k.py
+  - Name: resnetv1c152_8xb32_in1k
+    Metadata:
+      FLOPs: 11820000000
+      Parameters: 60210000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.76
+          Top 5 Accuracy: 94.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f.pth
+    Config: configs/resnet/resnetv1c152_8xb32_in1k.py
+  - Name: resnet50_8xb8_cub
+    Metadata:
+      FLOPs: 16480000000
+      Parameters: 23920000
+    In Collection: ResNet
+    Results:
+      - Dataset: CUB-200-2011
+        Metrics:
+          Top 1 Accuracy: 88.45
+        Task: Image Classification
+    Pretrain: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.pth
+    Config: configs/resnet/resnet50_8xb8_cub.py
diff --git a/configs/resnet/resnet101_8xb16_cifar10.py b/configs/resnet/resnet101_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..166a1740b09c5fb74462a0672cd5fef54caae8f7
--- /dev/null
+++ b/configs/resnet/resnet101_8xb16_cifar10.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet101_cifar.py',
+    '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet101_8xb32_in1k.py b/configs/resnet/resnet101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..388d2cd918ab75ec46346faa0448ef9cf2893fc8
--- /dev/null
+++ b/configs/resnet/resnet101_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet101.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet152_8xb16_cifar10.py b/configs/resnet/resnet152_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f307b6aa81661558b8308094de6e8327d08c830
--- /dev/null
+++ b/configs/resnet/resnet152_8xb16_cifar10.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet152_cifar.py',
+    '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet152_8xb32_in1k.py b/configs/resnet/resnet152_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc9dc2cee4a0fd8a9d47d461b2d5d00bf9962bf5
--- /dev/null
+++ b/configs/resnet/resnet152_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet152.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet18_8xb16_cifar10.py b/configs/resnet/resnet18_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7afa397b7b6a01decd0a010816ebe3678ca44aa
--- /dev/null
+++ b/configs/resnet/resnet18_8xb16_cifar10.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet18_cifar.py', '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet18_8xb32_in1k.py b/configs/resnet/resnet18_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac452ff75602464eba84a3eea150b30748122c69
--- /dev/null
+++ b/configs/resnet/resnet18_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet18.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet34_8xb16_cifar10.py b/configs/resnet/resnet34_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f5cd517d505ea479b506b6e4756c117c392dabd
--- /dev/null
+++ b/configs/resnet/resnet34_8xb16_cifar10.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet34_cifar.py', '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet34_8xb32_in1k.py b/configs/resnet/resnet34_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7749261c80defef7cbf94c4e1284c26382246dc6
--- /dev/null
+++ b/configs/resnet/resnet34_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet34.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_32xb64-warmup-coslr_in1k.py b/configs/resnet/resnet50_32xb64-warmup-coslr_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c26245ef53a736c22c0ef7d4e9d8b7876509fe2e
--- /dev/null
+++ b/configs/resnet/resnet50_32xb64-warmup-coslr_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs2048_coslr.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_32xb64-warmup-lbs_in1k.py b/configs/resnet/resnet50_32xb64-warmup-lbs_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f24f9a0f2c54a2bb634c1f374bc1b534d63697f
--- /dev/null
+++ b/configs/resnet/resnet50_32xb64-warmup-lbs_in1k.py
@@ -0,0 +1,12 @@
+_base_ = ['./resnet50_32xb64-warmup_in1k.py']
+model = dict(
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            loss_weight=1.0,
+            label_smooth_val=0.1,
+            num_classes=1000),
+    ))
diff --git a/configs/resnet/resnet50_32xb64-warmup_in1k.py b/configs/resnet/resnet50_32xb64-warmup_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..34d5288b9d3f9fcf3f0b409dc1c17906654c2170
--- /dev/null
+++ b/configs/resnet/resnet50_32xb64-warmup_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs2048.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb128_coslr-90e_in21k.py b/configs/resnet/resnet50_8xb128_coslr-90e_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d2cc1ee2830661998505310d8c7074d8ae5da6b4
--- /dev/null
+++ b/configs/resnet/resnet50_8xb128_coslr-90e_in21k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(head=dict(num_classes=21843))
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
diff --git a/configs/resnet/resnet50_8xb16-mixup_cifar10.py b/configs/resnet/resnet50_8xb16-mixup_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..2420ebfeb0a34675a4b1b2a69c0b8a39e197ce35
--- /dev/null
+++ b/configs/resnet/resnet50_8xb16-mixup_cifar10.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50_cifar_mixup.py',
+    '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb16_cifar10.py b/configs/resnet/resnet50_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..669e5de27e526dd46d9f06c99e478dce16f0ac9a
--- /dev/null
+++ b/configs/resnet/resnet50_8xb16_cifar10.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet50_cifar.py', '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb16_cifar100.py b/configs/resnet/resnet50_8xb16_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebde6c76ecca6d23b58edfb85ebc3b72ce15a2b2
--- /dev/null
+++ b/configs/resnet/resnet50_8xb16_cifar100.py
@@ -0,0 +1,19 @@
+_base_ = [
+    '../_base_/models/resnet50_cifar.py',
+    '../_base_/datasets/cifar100_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(head=dict(num_classes=100))
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(weight_decay=0.0005))
+
+param_scheduler = dict(
+    type='MultiStepLR',
+    by_epoch=True,
+    milestones=[60, 120, 160],
+    gamma=0.2,
+)
diff --git a/configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py b/configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4ea15984a0063c06e09eb5063d49b2cf90371cf
--- /dev/null
+++ b/configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/resnet50.py',
+    '../_base_/datasets/imagenet_bs256_rsb_a12.py',
+    '../_base_/schedules/imagenet_bs2048_rsb.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        norm_cfg=dict(type='SyncBN', requires_grad=True),
+        drop_path_rate=0.05,
+    ),
+    head=dict(
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+            use_sigmoid=True,
+        )),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.2),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(sampler=dict(type='RepeatAugSampler', shuffle=True))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=595,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=600)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=600)
diff --git a/configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py b/configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..df8edc0370400a3f3985c33bffae2d04afc55772
--- /dev/null
+++ b/configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/models/resnet50.py',
+    '../_base_/datasets/imagenet_bs256_rsb_a12.py',
+    '../_base_/schedules/imagenet_bs2048_rsb.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        norm_cfg=dict(type='SyncBN', requires_grad=True),
+        drop_path_rate=0.05,
+    ),
+    head=dict(loss=dict(use_sigmoid=True)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.1),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# dataset settings
+train_dataloader = dict(sampler=dict(type='RepeatAugSampler', shuffle=True))
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.))
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py b/configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a36c5843a69aea20fdb9287561e5c2a96459852
--- /dev/null
+++ b/configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py
@@ -0,0 +1,22 @@
+_base_ = [
+    '../_base_/models/resnet50.py',
+    '../_base_/datasets/imagenet_bs256_rsb_a3.py',
+    '../_base_/schedules/imagenet_bs2048_rsb.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    backbone=dict(norm_cfg=dict(type='SyncBN', requires_grad=True)),
+    head=dict(loss=dict(use_sigmoid=True)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.1),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=0.008),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
diff --git a/configs/resnet/resnet50_8xb32-coslr-preciseBN_in1k.py b/configs/resnet/resnet50_8xb32-coslr-preciseBN_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..01fefbbf2852eeceddb0ad026fb5098e763e0710
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-coslr-preciseBN_in1k.py
@@ -0,0 +1,13 @@
+_base_ = 'resnet50_8xb32-coslr_in1k.py'
+
+# Precise BN hook will update the bn stats, so this hook should be executed
+# before CheckpointHook(priority of 'VERY_LOW') and
+# EMAHook(priority of 'NORMAL') So set the priority of PreciseBNHook to
+# 'ABOVENORMAL' here.
+custom_hooks = [
+    dict(
+        type='PreciseBNHook',
+        num_samples=8192,
+        interval=1,
+        priority='ABOVE_NORMAL')
+]
diff --git a/configs/resnet/resnet50_8xb32-coslr_in1k.py b/configs/resnet/resnet50_8xb32-coslr_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..938a114b79696b5ad3442c1dd2a7aea33342b679
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-coslr_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32-cutmix_in1k.py b/configs/resnet/resnet50_8xb32-cutmix_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f8d0ca9f3a500344c18b669f25f3cb78393d7dd
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-cutmix_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50_cutmix.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32-fp16-dynamic_in1k.py b/configs/resnet/resnet50_8xb32-fp16-dynamic_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..58f6fe4cf25e8f0b3d321a7aab4b746552aa4163
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-fp16-dynamic_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['./resnet50_8xb32_in1k.py']
+
+# schedule settings
+optim_wrapper = dict(type='AmpOptimWrapper', loss_scale='dynamic')
diff --git a/configs/resnet/resnet50_8xb32-fp16_in1k.py b/configs/resnet/resnet50_8xb32-fp16_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..19ee6ee4f82ec02f34628bdf8dd74a379798cc67
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-fp16_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['./resnet50_8xb32_in1k.py']
+
+# schedule settings
+optim_wrapper = dict(type='AmpOptimWrapper', loss_scale=512.)
diff --git a/configs/resnet/resnet50_8xb32-lbs_in1k.py b/configs/resnet/resnet50_8xb32-lbs_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c1aa5a2c4eee10c10159175224d9b77ea57e57b
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-lbs_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50_label_smooth.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32-mixup_in1k.py b/configs/resnet/resnet50_8xb32-mixup_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a153d0e18f521f72b8beaf4cbea36d41f5b3300
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-mixup_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50_mixup.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32_in1k.py b/configs/resnet/resnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c32f333b67c255c6101469323636bf242eebb8da
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb8_cub.py b/configs/resnet/resnet50_8xb8_cub.py
new file mode 100644
index 0000000000000000000000000000000000000000..17054ef536930d74136897f8f25637321a364ce7
--- /dev/null
+++ b/configs/resnet/resnet50_8xb8_cub.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/resnet50.py',
+    '../_base_/datasets/cub_bs8_448.py',
+    '../_base_/schedules/cub_bs64.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+# use pre-train weight converted from https://github.com/Alibaba-MIIL/ImageNet21K # noqa
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth'  # noqa
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        init_cfg=dict(
+            type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+    head=dict(num_classes=200, ))
+
+# runtime settings
+default_hooks = dict(logger=dict(type='LoggerHook', interval=20))
diff --git a/configs/resnet/resnetv1c101_8xb32_in1k.py b/configs/resnet/resnetv1c101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..441aff591851f402a176c142c93dc866a77b82c2
--- /dev/null
+++ b/configs/resnet/resnetv1c101_8xb32_in1k.py
@@ -0,0 +1,7 @@
+_base_ = [
+    '../_base_/models/resnetv1c50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(depth=101))
diff --git a/configs/resnet/resnetv1c152_8xb32_in1k.py b/configs/resnet/resnetv1c152_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9f466f85c8e8c89fb78f53c27eca1d5acaf5221
--- /dev/null
+++ b/configs/resnet/resnetv1c152_8xb32_in1k.py
@@ -0,0 +1,7 @@
+_base_ = [
+    '../_base_/models/resnetv1c50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(depth=152))
diff --git a/configs/resnet/resnetv1c50_8xb32_in1k.py b/configs/resnet/resnetv1c50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa1c8b6475ce373f4a35123a72e31419b87027c0
--- /dev/null
+++ b/configs/resnet/resnetv1c50_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnetv1c50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnetv1d101_8xb32_in1k.py b/configs/resnet/resnetv1d101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b16ca863db2c50267764b1b37aa8b2db891ad2c9
--- /dev/null
+++ b/configs/resnet/resnetv1d101_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnetv1d101.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnetv1d152_8xb32_in1k.py b/configs/resnet/resnetv1d152_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..76926ddbb661029b8cff86ad0d98028531235fa1
--- /dev/null
+++ b/configs/resnet/resnetv1d152_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnetv1d152.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnetv1d50_8xb32_in1k.py b/configs/resnet/resnetv1d50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..208bde470ad12407d7e56eddeddfc88529e3708b
--- /dev/null
+++ b/configs/resnet/resnetv1d50_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnetv1d50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/README.md b/configs/resnext/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b901b31bd5bd3b99bce07cc2454e4b9a12d40bb2
--- /dev/null
+++ b/configs/resnext/README.md
@@ -0,0 +1,83 @@
+# ResNeXt
+
+> [Aggregated Residual Transformations for Deep Neural Networks](https://openaccess.thecvf.com/content_cvpr_2017/html/Xie_Aggregated_Residual_Transformations_CVPR_2017_paper.html)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call "cardinality" (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142574479-21fb00a2-e63e-4bc6-a9f2-989cd6e15528.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnext50-32x4d_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnext50-32x4d_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/resnext/resnext50-32x4d_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/resnext/resnext50-32x4d_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                                      Download                                      |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :--------------------------------------------------------------------------------: |
+| `resnext50-32x4d_8xb32_in1k`  | From scratch |   25.03    |   4.27    |   77.90   |   93.66   | [config](resnext50-32x4d_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.json) |
+| `resnext101-32x4d_8xb32_in1k` | From scratch |   44.18    |   8.03    |   78.61   |   94.17   | [config](resnext101-32x4d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.json) |
+| `resnext101-32x8d_8xb32_in1k` | From scratch |   88.79    |   16.50   |   79.27   |   94.58   | [config](resnext101-32x8d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.json) |
+| `resnext152-32x4d_8xb32_in1k` | From scratch |   59.95    |   11.80   |   78.88   |   94.33   | [config](resnext152-32x4d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{xie2017aggregated,
+  title={Aggregated residual transformations for deep neural networks},
+  author={Xie, Saining and Girshick, Ross and Doll{\'a}r, Piotr and Tu, Zhuowen and He, Kaiming},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={1492--1500},
+  year={2017}
+}
+```
diff --git a/configs/resnext/metafile.yml b/configs/resnext/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..71283288fd743116c00b14ee1dc1697770b0706c
--- /dev/null
+++ b/configs/resnext/metafile.yml
@@ -0,0 +1,73 @@
+Collections:
+  - Name: ResNeXt
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - ResNeXt
+    Paper:
+      URL: https://openaccess.thecvf.com/content_cvpr_2017/html/Xie_Aggregated_Residual_Transformations_CVPR_2017_paper.html
+      Title: "Aggregated Residual Transformations for Deep Neural Networks"
+    README: configs/resnext/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/resnext.py#L90
+      Version: v0.15.0
+
+Models:
+  - Name: resnext50-32x4d_8xb32_in1k
+    Metadata:
+      FLOPs: 4270000000
+      Parameters: 25030000
+    In Collection: ResNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.90
+          Top 5 Accuracy: 93.66
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth
+    Config: configs/resnext/resnext50-32x4d_8xb32_in1k.py
+  - Name: resnext101-32x4d_8xb32_in1k
+    Metadata:
+      FLOPs: 8030000000
+      Parameters: 44180000
+    In Collection: ResNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.61
+          Top 5 Accuracy: 94.17
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.pth
+    Config: configs/resnext/resnext101-32x4d_8xb32_in1k.py
+  - Name: resnext101-32x8d_8xb32_in1k
+    Metadata:
+      FLOPs: 16500000000
+      Parameters: 88790000
+    In Collection: ResNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.27
+          Top 5 Accuracy: 94.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.pth
+    Config: configs/resnext/resnext101-32x8d_8xb32_in1k.py
+  - Name: resnext152-32x4d_8xb32_in1k
+    Metadata:
+      FLOPs: 11800000000
+      Parameters: 59950000
+    In Collection: ResNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.88
+          Top 5 Accuracy: 94.33
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.pth
+    Config: configs/resnext/resnext152-32x4d_8xb32_in1k.py
diff --git a/configs/resnext/resnext101-32x4d_8xb32_in1k.py b/configs/resnext/resnext101-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..970aa60f35fb6b04f72688d5862155575858b1fe
--- /dev/null
+++ b/configs/resnext/resnext101-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnext101_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/resnext101-32x8d_8xb32_in1k.py b/configs/resnext/resnext101-32x8d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..315d05fd57b34d80ab1590077f98d21b80453209
--- /dev/null
+++ b/configs/resnext/resnext101-32x8d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnext101_32x8d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/resnext152-32x4d_8xb32_in1k.py b/configs/resnext/resnext152-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c137313cb7f357f8328048ffe833cdc4952cb84
--- /dev/null
+++ b/configs/resnext/resnext152-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnext152_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/resnext50-32x4d_8xb32_in1k.py b/configs/resnext/resnext50-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd9c9fcf4e6d9941cb87ffc963cc99b39069116c
--- /dev/null
+++ b/configs/resnext/resnext50-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnext50_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/revvit/README.md b/configs/revvit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0439b22ac9d196a56016503f210fc73d3baab71d
--- /dev/null
+++ b/configs/revvit/README.md
@@ -0,0 +1,91 @@
+# Reversible Vision Transformers
+
+> [Reversible Vision Transformers](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**RevViT** is initially described in [Reversible Vision Tranformers](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf), which introduce the reversible idea into vision transformer, to reduce the GPU memory footprint required for training.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://github.com/facebookresearch/SlowFast/raw/main/projects/rev/teaser.png" width="70%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory footprint from the depth of the model, Reversible Vision Transformers enable memory efficient scaling of transformer architectures. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5× at identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 3.9× over their non-reversible counterparts.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('revvit-small_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('revvit-small_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/revvit/revvit-small_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/revvit/revvit-base_3rdparty_in1k_20221213-87a7b0a5.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                 |                                       Download                                       |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------: | :----------------------------------------------------------------------------------: |
+| `revvit-small_3rdparty_in1k`\* | From scratch |   22.44    |   4.58    |   79.87   |   94.90   | [config](revvit-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/revvit/revvit-base_3rdparty_in1k_20221213-87a7b0a5.pth) |
+| `revvit-base_3rdparty_in1k`\*  | From scratch |   87.34    |   17.49   |   81.81   |   95.56   | [config](revvit-base_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/revvit/revvit-small_3rdparty_in1k_20221213-a3a34f5c.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/SlowFast). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{mangalam2022reversible,
+  title={Reversible Vision Transformers},
+  author={Mangalam, Karttikeya and Fan, Haoqi and Li, Yanghao and Wu, Chao-Yuan and Xiong, Bo and Feichtenhofer, Christoph and Malik, Jitendra},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={10830--10840},
+  year={2022}
+}
+```
diff --git a/configs/revvit/metafile.yml b/configs/revvit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..842de071f1b15cc9bc65b1ff85d208b6d7131b9d
--- /dev/null
+++ b/configs/revvit/metafile.yml
@@ -0,0 +1,48 @@
+Collections:
+  - Name: RevViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Vision Transformer
+        - Reversible
+    Paper:
+      URL: https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf
+      Title: Reversible Vision Transformers
+    README: configs/revvit/README.md
+    Code:
+      Version: v1.0.0rc5
+      URL: https://github.com/open-mmlab/mmpretrain/blob/1.0.0rc5/mmcls/models/backbones/revvit.py
+
+Models:
+  - Name: revvit-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 4583427072
+      Parameters: 22435432
+    In Collection: RevViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.87
+          Top 5 Accuracy: 94.90
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/revvit/revvit-small_3rdparty_in1k_20221213-a3a34f5c.pth
+    Config: configs/revvit/revvit-small_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_S.pyth
+      Code: https://github.com/facebookresearch/SlowFast
+  - Name: revvit-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 17490450432
+      Parameters: 87337192
+    In Collection: RevViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.81
+          Top 5 Accuracy: 95.56
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/revvit/revvit-base_3rdparty_in1k_20221213-87a7b0a5.pth
+    Config: configs/revvit/revvit-base_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_B.pyth
+      Code: https://github.com/facebookresearch/SlowFast
diff --git a/configs/revvit/revvit-base_8xb256_in1k.py b/configs/revvit/revvit-base_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4fde5c9487fb675b75c824608f88ba96f27e9aa
--- /dev/null
+++ b/configs/revvit/revvit-base_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/revvit/revvit-base.py',
+    '../_base_/datasets/imagenet_bs128_revvit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_revvit.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/revvit/revvit-small_8xb256_in1k.py b/configs/revvit/revvit-small_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec3904a3da8164f7f69c61e49d9dfee217a6b99b
--- /dev/null
+++ b/configs/revvit/revvit-small_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/revvit/revvit-small.py',
+    '../_base_/datasets/imagenet_bs128_revvit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_revvit.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/riformer/README.md b/configs/riformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6be694d1bf72fd7ba5e5bac0c99d33b9338e0893
--- /dev/null
+++ b/configs/riformer/README.md
@@ -0,0 +1,181 @@
+# RIFormer
+
+> [RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer](https://arxiv.org/abs/2304.05659)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+RIFormer is a way to keep a vision backbone effective while removing token mixers in its basic building blocks. Equipped with our proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. RIFormer shares nearly the same macro and micro design as MetaFormer, but safely removing all token mixers. The quantitative results show that our networks outperform many prevailing backbones with faster inference speed on ImageNet-1K.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/48375204/223930120-dc075c8e-0513-42eb-9830-469a45c1d941.png" width="65%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design.
+</br>
+
+</details>
+
+## How to use
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool or `switch_to_deploy` interface to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+Use `classifier.backbone.switch_to_deploy()` interface to switch the RIFormer models into inference mode.
+
+```python
+>>> import torch
+>>> from mmpretrain import get_model, inference_model
+>>>
+>>> model = get_model("riformer-s12_in1k", pretrained=True)
+>>> results = inference_model(model, 'demo/demo.JPEG')
+>>> print( (results['pred_class'], results['pred_score']) )
+('sea snake', 0.7827484011650085)
+>>>
+>>> # switch to deploy mode
+>>> model.backbone.switch_to_deploy()
+>>> results = inference_model(model, 'demo/demo.JPEG')
+>>> print( (results['pred_class'], results['pred_score']) )
+('sea snake', 0.7827480435371399)
+```
+
+**Use the model**
+
+```python
+>>> import torch
+>>>
+>>> model = get_model("riformer-s12_in1k", pretrained=True)
+>>> model.eval()
+>>> inputs = torch.rand(1, 3, 224, 224).to(model.data_preprocessor.device)
+>>> # To get classification scores.
+>>> out = model(inputs)
+>>> print(out.shape)
+torch.Size([1, 1000])
+>>> # To extract features.
+>>> outs = model.extract_feat(inputs)
+>>> print(outs[0].shape)
+torch.Size([1, 512])
+>>>
+>>> # switch to deploy mode
+>>> model.backbone.switch_to_deploy()
+>>> out_deploy = model(inputs)
+>>> print(out.shape)
+torch.Size([1, 1000])
+>>> assert torch.allclose(out, out_deploy, rtol=1e-4, atol=1e-5) # pass without error
+```
+
+**Test Command**
+
+Place the ImageNet dataset to the `data/imagenet/` directory, or prepare datasets according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+*224×224*
+
+Download Checkpoint:
+
+```shell
+wget https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+```
+
+Test use unfused model:
+
+```shell
+python tools/test.py configs/riformer/riformer-s12_8xb128_in1k.py riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+```
+
+Reparameterize checkpoint:
+
+```shell
+python tools/model_converters/reparameterize_model.py configs/riformer/riformer-s12_8xb128_in1k.py riformer-s12_32xb128_in1k_20230406-6741ce71.pth riformer-s12_deploy.pth
+```
+
+Test use fused model:
+
+```shell
+python tools/test.py configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py riformer-s12_deploy.pth
+```
+
+<!-- [TABS-END] -->
+
+For more configurable parameters, please refer to the [API](https://mmpretrain.readthedocs.io/en/latest/api/generated/mmpretrain.models.backbones.RIFormer.html#mmpretrain.models.backbones.RIFormer).
+
+<details>
+
+<summary><b>How to use the reparameterization tool</b>(click to show)</summary>
+
+<br>
+
+Use provided tool to reparameterize the given model and save the checkpoint:
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file path, `${SRC_CKPT_PATH}` is the source chenpoint file path, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+For example:
+
+```shell
+# download the weight
+wget https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+
+# reparameterize unfused weight to fused weight
+python tools/model_converters/reparameterize_model.py configs/riformer/riformer-s12_8xb128_in1k.py riformer-s12_32xb128_in1k_20230406-6741ce71.pth riformer-s12_deploy.pth
+```
+
+To use reparameterized weights, you can use the deploy model config file such as the [s12_deploy example](./deploy/riformer-s12-deploy_8xb128_in1k.py):
+
+```text
+# in riformer-s12-deploy_8xb128_in1k.py
+_base_ = '../deploy/riformer-s12-deploy_8xb128_in1k.py'  # basic s12 config
+
+model = dict(backbone=dict(deploy=True))  # switch model into deploy mode
+```
+
+```shell
+python tools/test.py configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py riformer-s12_deploy.pth
+```
+
+</br>
+
+</details>
+
+## Results and models
+
+### ImageNet-1k
+
+|         Model         | resolution | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) |                    Config                     |                                         Download                                          |
+| :-------------------: | :--------: | :-------: | :------: | :-------: | :-------: | :-------------------------------------------: | :---------------------------------------------------------------------------------------: |
+|   riformer-s12_in1k   |  224x224   |   11.92   |   1.82   |   76.90   |   93.06   |    [config](./riformer-s12_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth) |
+|   riformer-s24_in1k   |  224x224   |   21.39   |   3.41   |   80.28   |   94.80   |    [config](./riformer-s24_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k_20230406-fdab072a.pth) |
+|   riformer-s36_in1k   |  224x224   |   30.86   |   5.00   |   81.29   |   95.41   |    [config](./riformer-s36_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k_20230406-fdfcd3b0.pth) |
+|   riformer-m36_in1k   |  224x224   |   56.17   |   8.80   |   82.57   |   95.99   |    [config](./riformer-m36_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k_20230406-2fcb9d9b.pth) |
+|   riformer-m48_in1k   |  224x224   |   73.47   |  11.59   |   82.75   |   96.11   |    [config](./riformer-m48_8xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k_20230406-2b9d1abf.pth) |
+| riformer-s12_384_in1k |  384x384   |   11.92   |   5.36   |   78.29   |   93.93   | [config](./riformer-s12_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k-384px_20230406-145eda4c.pth) |
+| riformer-s24_384_in1k |  384x384   |   21.39   |  10.03   |   81.36   |   95.40   | [config](./riformer-s24_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k-384px_20230406-bafae7ab.pth) |
+| riformer-s36_384_in1k |  384x384   |   30.86   |  14.70   |   82.22   |   95.95   | [config](./riformer-s36_8xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k-384px_20230406-017ed3c4.pth) |
+| riformer-m36_384_in1k |  384x384   |   56.17   |  25.87   |   83.39   |   96.40   | [config](./riformer-m36_8xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k-384px_20230406-66a6f764.pth) |
+| riformer-m48_384_in1k |  384x384   |   73.47   |  34.06   |   83.70   |   96.60   | [config](./riformer-m48_8xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k-384px_20230406-2e874826.pth) |
+
+The config files of these models are only for inference.
+
+## Citation
+
+```bibtex
+@inproceedings{wang2023riformer,
+  title={RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer},
+  author={Wang, Jiahao and Zhang, Songyang and Liu, Yong and Wu, Taiqiang and Yang, Yujiu and Liu, Xihui and Chen, Kai and Luo, Ping and Lin, Dahua},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  year={2023}
+}
+```
diff --git a/configs/riformer/deploy/riformer-m36-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-m36-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcec41c810849d20c080faa1a710692e4b2bb9a0
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m36-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m36_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-m36-deploy_8xb64_in1k-384px.py b/configs/riformer/deploy/riformer-m36-deploy_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e18f836f89d9057b1d8a1b6d31cd83d6bdca6b3a
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m36-deploy_8xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m36_8xb64_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k-384px.py b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ab33534e271ccad60a9f6d896fa15238601a4e0
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m48_8xb64_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k.py b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e32ad328f893aaa0da1a4072315a91f514a594ce
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m48_8xb64_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k-384px.py b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffbb4be31d76716432ff283d9d7c2d77370ddbb0
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s12_8xb128_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..70fd8b74342e07ec2e3b4299364681ffbea5ec25
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s12_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k-384px.py b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d05e5c1a14afe10e05ae648e47c16d53220f226
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s24_8xb128_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..47f83a08f4f2c6fa6ffc7105265b41c12e30fd2e
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s24_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s36-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-s36-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c03bb15106829f22ba959d2a84d0a92ceba4dac
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s36-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s36_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s36-deploy_8xb64_in1k-384px.py b/configs/riformer/deploy/riformer-s36-deploy_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..67b17ee5173e5bef7d2ecdf6d92e09cbb48db482
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s36-deploy_8xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s36_8xb64_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/metafile.yml b/configs/riformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5f3e2ec8773d26cde570bb874d2a45a73a49bc7b
--- /dev/null
+++ b/configs/riformer/metafile.yml
@@ -0,0 +1,152 @@
+Collections:
+  - Name: RIFormer
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Resources: 8x A100 GPUs
+      Architecture:
+        - Affine
+        - 1x1 Convolution
+        - LayerScale
+    Paper:
+      URL: https://arxiv.org/abs/xxxx.xxxxx
+      Title: "RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer"
+    README: configs/riformer/README.md
+    Code:
+      Version: v1.0.0rc7
+      URL: null
+
+Models:
+  - Name: riformer-s12_in1k
+    Metadata:
+      FLOPs: 1822000000
+      Parameters: 11915000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.90
+          Top 5 Accuracy: 93.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+    Config: configs/riformer/riformer-s12_8xb128_in1k.py
+  - Name: riformer-s24_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3412000000
+      Parameters: 21389000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.28
+          Top 5 Accuracy: 94.80
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k_20230406-fdab072a.pth
+    Config: configs/riformer/riformer-s24_8xb128_in1k.py
+  - Name: riformer-s36_in1k
+    Metadata:
+      FLOPs: 5003000000
+      Parameters: 30863000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.29
+          Top 5 Accuracy: 95.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k_20230406-fdfcd3b0.pth
+    Config: configs/riformer/riformer-s36_8xb128_in1k.py
+  - Name: riformer-m36_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 8801000000
+      Parameters: 56173000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.57
+          Top 5 Accuracy: 95.99
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k_20230406-2fcb9d9b.pth
+    Config: configs/riformer/riformer-m36_8xb128_in1k.py
+  - Name: riformer-m48_in1k
+    Metadata:
+      FLOPs: 11590000000
+      Parameters: 73473000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.75
+          Top 5 Accuracy: 96.11
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k_20230406-2b9d1abf.pth
+    Config: configs/riformer/riformer-m48_8xb64_in1k.py
+  - Name: riformer-s12_in1k-384
+    Metadata:
+      FLOPs: 5355000000
+      Parameters: 11915000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.29
+          Top 5 Accuracy: 93.93
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k-384px_20230406-145eda4c.pth
+    Config: configs/riformer/riformer-s12_8xb128_in1k-384px.py
+  - Name: riformer-s24_in1k-384
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 10028000000
+      Parameters: 21389000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.36
+          Top 5 Accuracy: 95.40
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k-384px_20230406-bafae7ab.pth
+    Config: configs/riformer/riformer-s24_8xb128_in1k-384px.py
+  - Name: riformer-s36_in1k-384
+    Metadata:
+      FLOPs: 14702000000
+      Parameters: 30863000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.22
+          Top 5 Accuracy: 95.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k-384px_20230406-017ed3c4.pth
+    Config: configs/riformer/riformer-s36_8xb64_in1k-384px.py
+  - Name: riformer-m36_in1k-384
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 25865000000
+      Parameters: 56173000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.39
+          Top 5 Accuracy: 96.40
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k-384px_20230406-66a6f764.pth
+    Config: configs/riformer/riformer-m36_8xb64_in1k-384px.py
+  - Name: riformer-m48_in1k-384
+    Metadata:
+      FLOPs: 34060000000
+      Parameters: 73473000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.70
+          Top 5 Accuracy: 96.60
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k-384px_20230406-2e874826.pth
+    Config: configs/riformer/riformer-m48_8xb64_in1k-384px.py
diff --git a/configs/riformer/riformer-m36_8xb128_in1k.py b/configs/riformer/riformer-m36_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..30e93aa83d0f5c0b379367e2dc9b7a7d038108b4
--- /dev/null
+++ b/configs/riformer/riformer-m36_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='m36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-m36_8xb64_in1k-384px.py b/configs/riformer/riformer-m36_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..57f687cd50b60d99978dec7baeec4bf6a67e5de5
--- /dev/null
+++ b/configs/riformer/riformer-m36_8xb64_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_medium_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='m36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-m48_8xb64_in1k-384px.py b/configs/riformer/riformer-m48_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef6f1964624f76e204a5d257ddee2410f21ab456
--- /dev/null
+++ b/configs/riformer/riformer-m48_8xb64_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_medium_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='m48',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-m48_8xb64_in1k.py b/configs/riformer/riformer-m48_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9dc5c3e291f136d40633e05c9c2931d140c532bc
--- /dev/null
+++ b/configs/riformer/riformer-m48_8xb64_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='m48',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s12_8xb128_in1k-384px.py b/configs/riformer/riformer-s12_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d19dae07c811aeb0ca5af3cb92e57903405e49b
--- /dev/null
+++ b/configs/riformer/riformer-s12_8xb128_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_small_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s12',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s12_8xb128_in1k.py b/configs/riformer/riformer-s12_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e85f8fb883de19f1021b8148fc680711149b5a9d
--- /dev/null
+++ b/configs/riformer/riformer-s12_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s12',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s24_8xb128_in1k-384px.py b/configs/riformer/riformer-s24_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a1ec7b57385c4910ffaebcd152296bbdee360e1
--- /dev/null
+++ b/configs/riformer/riformer-s24_8xb128_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_small_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s24',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s24_8xb128_in1k.py b/configs/riformer/riformer-s24_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..560cddcf8829703d2f1e9aaf4856e947b762b49a
--- /dev/null
+++ b/configs/riformer/riformer-s24_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s24',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s36_8xb128_in1k.py b/configs/riformer/riformer-s36_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..28511307a294031301cb425d513844780d199606
--- /dev/null
+++ b/configs/riformer/riformer-s36_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s36_8xb64_in1k-384px.py b/configs/riformer/riformer-s36_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b3077357051632c81426e5d94322558412430373
--- /dev/null
+++ b/configs/riformer/riformer-s36_8xb64_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_small_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/sam/README.md b/configs/sam/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1a5668a3d0bff5aacac10f26a41714afe3622c78
--- /dev/null
+++ b/configs/sam/README.md
@@ -0,0 +1,57 @@
+# SAM
+
+> [Segment Anything](https://arxiv.org/abs/2304.02643)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billionmasks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/231106092-261ff035-dd3b-4a8b-b2e7-e91f195090a1.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-base-p16_sam-pre_3rdparty_sa1b-1024px', pretrained=True)
+inputs = torch.rand(1, 3, 1024, 1024)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                          | Params (M) | Flops (G) |                 Config                  |                                             Download                                             |
+| :--------------------------------------------- | :--------: | :-------: | :-------------------------------------: | :----------------------------------------------------------------------------------------------: |
+| `vit-base-p16_sam-pre_3rdparty_sa1b-1024px`\*  |   89.67    |  486.00   | [config](vit-base-p16_sam_headless.py)  | [model](https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-base-p16_sam-pre_3rdparty_sa1b-1024px_20230411-2320f9cc.pth) |
+| `vit-large-p16_sam-pre_3rdparty_sa1b-1024px`\* |   308.00   |  1494.00  | [config](vit-large-p16_sam_headless.py) | [model](https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-large-p16_sam-pre_3rdparty_sa1b-1024px_20230411-595feafd.pth) |
+| `vit-huge-p16_sam-pre_3rdparty_sa1b-1024px`\*  |   637.00   |  2982.00  | [config](vit-huge-p16_sam_headless.py)  | [model](https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-huge-p16_sam-pre_3rdparty_sa1b-1024px_20230411-3f13c653.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/segment-anything/). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{kirillov2023segany,
+  title={Segment Anything},
+  author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
+  journal={arXiv:2304.02643},
+  year={2023}
+}
+```
diff --git a/configs/sam/metafile.yml b/configs/sam/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1ac65ce7715e91468e108132493ecdcbb4db277c
--- /dev/null
+++ b/configs/sam/metafile.yml
@@ -0,0 +1,61 @@
+Collections:
+  - Name: SAM
+    Metadata:
+      Architecture:
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'Segment Anything'
+      URL: https://arxiv.org/abs/2304.02643
+    README: configs/sam/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: vit-base-p16_sam-pre_3rdparty_sa1b-1024px
+    Metadata:
+      FLOPs: 486000000000
+      Parameters: 89671000
+      Training Data:
+        - SA-1B
+    In Collection: SAM
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-base-p16_sam-pre_3rdparty_sa1b-1024px_20230411-2320f9cc.pth
+    Config: configs/sam/vit-base-p16_sam_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
+      Code: https://github.com/facebookresearch/segment-anything/
+
+  - Name: vit-large-p16_sam-pre_3rdparty_sa1b-1024px
+    Metadata:
+      FLOPs: 1494000000000
+      Parameters: 308000000
+      Training Data:
+        - SA-1B
+    In Collection: SAM
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-large-p16_sam-pre_3rdparty_sa1b-1024px_20230411-595feafd.pth
+    Config: configs/sam/vit-large-p16_sam_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth
+      Code: https://github.com/facebookresearch/segment-anything/
+
+  - Name: vit-huge-p16_sam-pre_3rdparty_sa1b-1024px
+    Metadata:
+      FLOPs: 2982000000000
+      Parameters: 637000000
+      Training Data:
+        - SA-1B
+    In Collection: SAM
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-huge-p16_sam-pre_3rdparty_sa1b-1024px_20230411-3f13c653.pth
+    Config: configs/sam/vit-huge-p16_sam_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+      Code: https://github.com/facebookresearch/segment-anything/
diff --git a/configs/sam/vit-base-p16_sam_headless.py b/configs/sam/vit-base-p16_sam_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..bea26376ee932af5704fd5d232efc3cdf128e310
--- /dev/null
+++ b/configs/sam/vit-base-p16_sam_headless.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTSAM',
+        arch='base',
+        img_size=1024,
+        patch_size=16,
+        out_channels=256,
+        use_abs_pos=True,
+        use_rel_pos=True,
+        window_size=14,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/sam/vit-huge-p16_sam_headless.py b/configs/sam/vit-huge-p16_sam_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..8004755bfbe7dd0e5366297f03f73494dc27c27b
--- /dev/null
+++ b/configs/sam/vit-huge-p16_sam_headless.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTSAM',
+        arch='huge',
+        img_size=1024,
+        patch_size=16,
+        out_channels=256,
+        use_abs_pos=True,
+        use_rel_pos=True,
+        window_size=14,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/sam/vit-large-p16_sam_headless.py b/configs/sam/vit-large-p16_sam_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cebeb098205d081a4340fb4af369e2c29a20d66
--- /dev/null
+++ b/configs/sam/vit-large-p16_sam_headless.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTSAM',
+        arch='large',
+        img_size=1024,
+        patch_size=16,
+        out_channels=256,
+        use_abs_pos=True,
+        use_rel_pos=True,
+        window_size=14,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/seresnet/README.md b/configs/seresnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b5151ccde85112f12af2170796b169933e9a93ab
--- /dev/null
+++ b/configs/seresnet/README.md
@@ -0,0 +1,81 @@
+# SEResNet
+
+> [Squeeze-and-Excitation Networks](https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142574668-3464d087-b962-48ba-ad1d-5d6b33c3ba0b.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('seresnet50_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('seresnet50_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/seresnet/seresnet50_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/seresnet/seresnet50_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-ae206104.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                    |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                                           Download                                           |
+| :----------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :------------------------------------------------------------------------------------------: |
+| `seresnet50_8xb32_in1k`  | From scratch |   28.09    |   4.13    |   77.74   |   93.84   | [config](seresnet50_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-ae206104.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200708-657b3c36.log.json) |
+| `seresnet101_8xb32_in1k` | From scratch |   49.33    |   7.86    |   78.26   |   94.07   | [config](seresnet101_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet101_batch256_imagenet_20200804-ba5b51d4.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet101_batch256_imagenet_20200708-038a4d04.log.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{hu2018squeeze,
+  title={Squeeze-and-excitation networks},
+  author={Hu, Jie and Shen, Li and Sun, Gang},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={7132--7141},
+  year={2018}
+}
+```
diff --git a/configs/seresnet/metafile.yml b/configs/seresnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1a9f116da4c8014e91e31af5db33d7b13b151826
--- /dev/null
+++ b/configs/seresnet/metafile.yml
@@ -0,0 +1,47 @@
+Collections:
+  - Name: SEResNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 140
+      Batch Size: 256
+      Architecture:
+        - ResNet
+    Paper:
+      URL: https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html
+      Title: "Squeeze-and-Excitation Networks"
+    README: configs/seresnet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/seresnet.py#L58
+      Version: v0.15.0
+
+Models:
+  - Name: seresnet50_8xb32_in1k
+    Metadata:
+      FLOPs: 4130000000
+      Parameters: 28090000
+    In Collection: SEResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.74
+          Top 5 Accuracy: 93.84
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-ae206104.pth
+    Config: configs/seresnet/seresnet50_8xb32_in1k.py
+  - Name: seresnet101_8xb32_in1k
+    Metadata:
+      FLOPs: 7860000000
+      Parameters: 49330000
+    In Collection: SEResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.26
+          Top 5 Accuracy: 94.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet101_batch256_imagenet_20200804-ba5b51d4.pth
+    Config: configs/seresnet/seresnet101_8xb32_in1k.py
diff --git a/configs/seresnet/seresnet101_8xb32_in1k.py b/configs/seresnet/seresnet101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8be39e7a32aa38a5c7d0b355c39a28ddff087cf1
--- /dev/null
+++ b/configs/seresnet/seresnet101_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/seresnet101.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/seresnet/seresnet50_8xb32_in1k.py b/configs/seresnet/seresnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..19082bd0dd6bde367a064900f5c51d730bea2923
--- /dev/null
+++ b/configs/seresnet/seresnet50_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/seresnet50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_140e.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/seresnet/seresnext101-32x4d_8xb32_in1k.py b/configs/seresnet/seresnext101-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..01778305caf8196e73a77f39783ead80a0c3ea56
--- /dev/null
+++ b/configs/seresnet/seresnext101-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/seresnext101_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/seresnet/seresnext50-32x4d_8xb32_in1k.py b/configs/seresnet/seresnext50-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d593e45b8992254f97de77fa4d157e9c31ce352
--- /dev/null
+++ b/configs/seresnet/seresnext50-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/seresnext50_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/shufflenet_v1/README.md b/configs/shufflenet_v1/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..618a22d775eae984809e4881207c0f645fc1d8c9
--- /dev/null
+++ b/configs/shufflenet_v1/README.md
@@ -0,0 +1,80 @@
+# Shufflenet V1
+
+> [ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices](https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 7.8%) than recent MobileNet on ImageNet classification task, under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves ~13x actual speedup over AlexNet while maintaining comparable accuracy.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142575730-dc2f616d-80df-4fb1-93e1-77ebb2b835cf.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('shufflenet-v1-1x_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('shufflenet-v1-1x_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                                     Download                                     |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :------------------------------------------------------------------------------: |
+| `shufflenet-v1-1x_16xb64_in1k` | From scratch |    1.87    |   0.15    |   68.13   |   87.81   | [config](shufflenet-v1-1x_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{zhang2018shufflenet,
+  title={Shufflenet: An extremely efficient convolutional neural network for mobile devices},
+  author={Zhang, Xiangyu and Zhou, Xinyu and Lin, Mengxiao and Sun, Jian},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={6848--6856},
+  year={2018}
+}
+```
diff --git a/configs/shufflenet_v1/metafile.yml b/configs/shufflenet_v1/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e3ca1393e629153f81791c4f584ec0ded04839e2
--- /dev/null
+++ b/configs/shufflenet_v1/metafile.yml
@@ -0,0 +1,35 @@
+Collections:
+  - Name: Shufflenet V1
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+        - No BN decay
+      Training Resources: 8x 1080 GPUs
+      Epochs: 300
+      Batch Size: 1024
+      Architecture:
+        - Shufflenet V1
+    Paper:
+      URL: https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html
+      Title: "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices"
+    README: configs/shufflenet_v1/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/shufflenet_v1.py#L152
+      Version: v0.15.0
+
+Models:
+  - Name: shufflenet-v1-1x_16xb64_in1k
+    Metadata:
+      FLOPs: 146000000
+      Parameters: 1870000
+    In Collection: Shufflenet V1
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 68.13
+          Top 5 Accuracy: 87.81
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth
+    Config: configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
diff --git a/configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py b/configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..58e45f1ba419f285d750d4487e40a3dbc803d8e1
--- /dev/null
+++ b/configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/shufflenet_v1_1x.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/shufflenet_v2/README.md b/configs/shufflenet_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..804aac18087ad8d1cf49c4b7c10ab36eb8128ade
--- /dev/null
+++ b/configs/shufflenet_v2/README.md
@@ -0,0 +1,80 @@
+# Shufflenet V2
+
+> [ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design](https://openaccess.thecvf.com/content_ECCV_2018/papers/Ningning_Light-weight_CNN_Architecture_ECCV_2018_paper.pdf)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Currently, the neural network architecture design is mostly guided by the *indirect* metric of computation complexity, i.e., FLOPs. However, the *direct* metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical *guidelines* for efficient network design. Accordingly, a new architecture is presented, called *ShuffleNet V2*. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142576336-e0db2866-3add-44e6-a792-14d4f11bd983.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('shufflenet-v2-1x_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('shufflenet-v2-1x_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                                     Download                                     |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :------------------------------------------------------------------------------: |
+| `shufflenet-v2-1x_16xb64_in1k` | From scratch |    2.28    |   0.15    |   69.55   |   88.92   | [config](shufflenet-v2-1x_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{ma2018shufflenet,
+  title={Shufflenet v2: Practical guidelines for efficient cnn architecture design},
+  author={Ma, Ningning and Zhang, Xiangyu and Zheng, Hai-Tao and Sun, Jian},
+  booktitle={Proceedings of the European conference on computer vision (ECCV)},
+  pages={116--131},
+  year={2018}
+}
+```
diff --git a/configs/shufflenet_v2/metafile.yml b/configs/shufflenet_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..9c1eebc5e9fdb66523f719bdae1bdd38a58fea84
--- /dev/null
+++ b/configs/shufflenet_v2/metafile.yml
@@ -0,0 +1,35 @@
+Collections:
+  - Name: Shufflenet V2
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+        - No BN decay
+      Training Resources: 8x 1080 GPUs
+      Epochs: 300
+      Batch Size: 1024
+      Architecture:
+        - Shufflenet V2
+    Paper:
+      URL: https://openaccess.thecvf.com/content_ECCV_2018/papers/Ningning_Light-weight_CNN_Architecture_ECCV_2018_paper.pdf
+      Title: "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design"
+    README: configs/shufflenet_v2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/shufflenet_v2.py#L134
+      Version: v0.15.0
+
+Models:
+  - Name: shufflenet-v2-1x_16xb64_in1k
+    Metadata:
+      FLOPs: 149000000
+      Parameters: 2280000
+    In Collection: Shufflenet V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.55
+          Top 5 Accuracy: 88.92
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth
+    Config: configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
diff --git a/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py b/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a106ab8686c985a66b1c9b6af3407ef48a40c64e
--- /dev/null
+++ b/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/shufflenet_v2_1x.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/simclr/README.md b/configs/simclr/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..17d0de2b79499ec47cdcb4e5eff59d362b77fced
--- /dev/null
+++ b/configs/simclr/README.md
@@ -0,0 +1,87 @@
+# SimCLR
+
+> [A simple framework for contrastive learning of visual representations](https://arxiv.org/abs/2002.05709)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/149723851-cf5f309e-d891-454d-90c0-e5337e5a11ed.png" width="400" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('simclr_resnet50_16xb256-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                     | Params (M) | Flops (G) |                        Config                        |                                         Download                                         |
+| :---------------------------------------- | :--------: | :-------: | :--------------------------------------------------: | :--------------------------------------------------------------------------------------: |
+| `simclr_resnet50_16xb256-coslr-200e_in1k` |   27.97    |   4.11    | [config](simclr_resnet50_16xb256-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.json) |
+| `simclr_resnet50_16xb256-coslr-800e_in1k` |   27.97    |   4.11    | [config](simclr_resnet50_16xb256-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k` | [SIMCLR 200-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.pth) |   25.56    |   4.11    |   66.90   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.json) |
+| `resnet50_simclr-800e-pre_8xb512-linear-coslr-90e_in1k` | [SIMCLR 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.pth) |   25.56    |   4.11    |   69.20   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-b80ae1e5.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-b80ae1e5.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{chen2020simple,
+  title={A simple framework for contrastive learning of visual representations},
+  author={Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey},
+  booktitle={ICML},
+  year={2020},
+}
+```
diff --git a/configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simclr/metafile.yml b/configs/simclr/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..23c31ed3533160739f66731b9c02f6547910dd44
--- /dev/null
+++ b/configs/simclr/metafile.yml
@@ -0,0 +1,72 @@
+Collections:
+  - Name: SimCLR
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 8x V100 GPUs (b256), 16x A100-80G GPUs (b4096)
+      Architecture:
+        - ResNet
+        - SimCLR
+    Paper:
+      Title: A simple framework for contrastive learning of visual representations
+      URL: https://arxiv.org/abs/2002.05709
+    README: configs/simclr/README.md
+
+Models:
+  - Name: simclr_resnet50_16xb256-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 27968832
+      Training Data: ImageNet-1k
+    In Collection: SimCLR
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.pth
+    Config: configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k
+  - Name: simclr_resnet50_16xb256-coslr-800e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 27968832
+      Training Data: ImageNet-1k
+    In Collection: SimCLR
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.pth
+    Config: configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py
+    Downstream:
+      - resnet50_simclr-800e-pre_8xb512-linear-coslr-90e_in1k
+  - Name: resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SimCLR
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 66.9
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.pth
+    Config: configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
+  - Name: resnet50_simclr-800e-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SimCLR
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.2
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-b80ae1e5.pth
+    Config: configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py b/configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b48d5b31071dbb5622616b62835caa6cdd8d9589
--- /dev/null
+++ b/configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_simclr.py',
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='SimCLR',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',  # SimCLR non-linear neck
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_layers=2,
+        with_avg_pool=True),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }))
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py b/configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..478ef0c33418a9467d01c2a0c133be119318359c
--- /dev/null
+++ b/configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py
@@ -0,0 +1,57 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_simclr.py',
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimCLR',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',  # SimCLR non-linear neck
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_layers=2,
+        with_avg_pool=True),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR', T_max=790, by_epoch=True, begin=10, end=800)
+]
+
+# runtime settings
+train_cfg = dict(max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py b/configs/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..36a144536e832c5e022675f3f6878d1cfa71c563
--- /dev/null
+++ b/configs/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,47 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_simclr.py',
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimCLR',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',  # SimCLR non-linear neck
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_layers=2,
+        with_avg_pool=True),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=0.3, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }))
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/simmim/README.md b/configs/simmim/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3e44b0790086ac62c5719eba3198fd531f2dab98
--- /dev/null
+++ b/configs/simmim/README.md
@@ -0,0 +1,90 @@
+# SimMIM
+
+> [SimMIM: A Simple Framework for Masked Image Modeling](https://arxiv.org/abs/2111.09886)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper presents SimMIM, a simple framework for masked image modeling. We simplify recently proposed related approaches without special designs such as blockwise masking and tokenization via discrete VAE or clustering. To study what let the masked image modeling task learn good representations, we systematically study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting raw pixels of RGB values by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones. Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%. When applied on a larger model of about 650 million parameters, SwinV2H, it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by 40× less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks. The code and models will be publicly available at https: //github.com/microsoft/SimMIM .
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/159404597-ac6d3a44-ee59-4cdc-8f6f-506a7d1b18b6.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                                     | Params (M) | Flops (G) |                            Config                             |                            Download                             |
+| :-------------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :-------------------------------------------------------------: |
+| `simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px`    |   89.87    |   18.83   | [config](simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.json) |
+| `simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px`   |   89.87    |   18.83   | [config](simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.json) |
+| `simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px` |   199.92   |   55.85   | [config](simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px` | [SIMMIM 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth) |   87.75    |   11.30   |   82.70   | [config](benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.json) |
+| `swin-base-w7_simmim-100e-pre_8xb256-coslr-100e_in1k` | [SIMMIM 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth) |   87.77    |   15.47   |   83.50   | [config](benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py) |                      N/A                      |
+| `swin-base-w6_simmim-800e-pre_8xb256-coslr-100e_in1k-192px` | [SIMMIM 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.pth) |   87.77    |   15.47   |   83.80   | [config](benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k-224/swin-base_ft-8xb256-coslr-100e_in1k-224_20221208-155cc6e6.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k-224/swin-base_ft-8xb256-coslr-100e_in1k-224_20221208-155cc6e6.json) |
+| `swin-large-w14_simmim-800e-pre_8xb256-coslr-100e_in1k` | [SIMMIM 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.pth) |   196.85   |   38.85   |   84.80   | [config](benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224_20220916-d4865790.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224_20220916-d4865790.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{xie2021simmim,
+  title={SimMIM: A Simple Framework for Masked Image Modeling},
+  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
+  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
+  year={2022}
+}
+```
diff --git a/configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py b/configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..47c4fa1ccfa42b0d6a3c7eb58f43f8250441b7f3
--- /dev/null
+++ b/configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py
@@ -0,0 +1,59 @@
+_base_ = [
+    '../../_base_/models/swin_transformer/base_224.py',
+    '../../_base_/datasets/imagenet_bs256_swin_192.py',
+    '../../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        img_size=192,
+        drop_path_rate=0.1,
+        stage_cfgs=dict(block_cfgs=dict(window_size=6)),
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer settings
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(type='AdamW', lr=5e-3, weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.9,
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=2.5e-7 / 1.25e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        eta_min=2.5e-7 * 2048 / 512,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=100))
+
+randomness = dict(seed=0)
diff --git a/configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py b/configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7325f03d6b495b9b775f4e2cc3c33a06f6af7dd
--- /dev/null
+++ b/configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
@@ -0,0 +1,102 @@
+_base_ = [
+    '../../_base_/models/swin_transformer/base_224.py',
+    '../../_base_/datasets/imagenet_bs256_swin_192.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    backbone=dict(
+        img_size=224,
+        drop_path_rate=0.1,
+        stage_cfgs=dict(block_cfgs=dict(window_size=7)),
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer settings
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(type='AdamW', lr=5e-3, weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.9,
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=2.5e-7 / 1.25e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        eta_min=2.5e-7 * 2048 / 512,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=100))
+
+randomness = dict(seed=0)
diff --git a/configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py b/configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6eafd84d3c3f3224567747bcf645114286394f0
--- /dev/null
+++ b/configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py
@@ -0,0 +1,105 @@
+_base_ = [
+    '../../_base_/models/swin_transformer/base_224.py',
+    '../../_base_/datasets/imagenet_bs256_swin_192.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    backbone=dict(
+        arch='large',
+        img_size=224,
+        drop_path_rate=0.2,
+        stage_cfgs=dict(block_cfgs=dict(window_size=14)),
+        pad_small_map=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(in_channels=1536))
+
+# optimizer settings
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(type='AdamW', lr=5e-3, weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.7,
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=2.5e-7 / 1.25e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=100,
+        eta_min=1e-6,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=100))
+
+randomness = dict(seed=0)
diff --git a/configs/simmim/metafile.yml b/configs/simmim/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..19d9446c45c5f86315cc61be206430ea7bd97643
--- /dev/null
+++ b/configs/simmim/metafile.yml
@@ -0,0 +1,115 @@
+Collections:
+  - Name: SimMIM
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 16x A100 GPUs
+      Architecture:
+        - Swin
+    Paper:
+      Title: 'SimMIM: A Simple Framework for Masked Image Modeling'
+      URL: https://arxiv.org/abs/2111.09886
+    README: configs/simmim/README.md
+
+Models:
+  - Name: simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 18832161792
+      Parameters: 89874104
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth
+    Config: configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
+    Downstream:
+      - swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px
+      - swin-base-w7_simmim-100e-pre_8xb256-coslr-100e_in1k
+  - Name: simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 18832161792
+      Parameters: 89874104
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.pth
+    Config: configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py
+    Downstream:
+      - swin-base-w6_simmim-800e-pre_8xb256-coslr-100e_in1k-192px
+  - Name: simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 55849130496
+      Parameters: 199920372
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.pth
+    Config: configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py
+    Downstream:
+      - swin-large-w14_simmim-800e-pre_8xb256-coslr-100e_in1k
+  - Name: swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 11303976960
+      Parameters: 87750176
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.7
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.pth
+    Config: configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py
+  - Name: swin-base-w7_simmim-100e-pre_8xb256-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 15466852352
+      Parameters: 87768224
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.5
+    Weights: null
+    Config: configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
+  - Name: swin-base-w6_simmim-800e-pre_8xb256-coslr-100e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 15466852352
+      Parameters: 87768224
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k-224/swin-base_ft-8xb256-coslr-100e_in1k-224_20221208-155cc6e6.pth
+    Config: configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
+  - Name: swin-large-w14_simmim-800e-pre_8xb256-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 38853083136
+      Parameters: 196848316
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224_20220916-d4865790.pth
+    Config: configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py
diff --git a/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-100e_in1k-192px.py b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-100e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed9dfdb85d6ebb0e87f18257a9320bc9166f4c5e
--- /dev/null
+++ b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-100e_in1k-192px.py
@@ -0,0 +1,4 @@
+_base_ = 'simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py'
+
+# dataset 16 GPUs x 128
+train_dataloader = dict(batch_size=128)
diff --git a/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..560714b7d6a74a22f6d8bb4358a0977fc73909e8
--- /dev/null
+++ b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py
@@ -0,0 +1,64 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_simmim_192.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimMIM',
+    backbone=dict(
+        type='SimMIMSwinTransformer',
+        arch='base',
+        img_size=192,
+        stage_cfgs=dict(block_cfgs=dict(window_size=6))),
+    neck=dict(
+        type='SimMIMLinearDecoder', in_channels=128 * 2**3, encoder_stride=32),
+    head=dict(
+        type='SimMIMHead',
+        patch_size=4,
+        loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3)))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1e-4 * 2048 / 512,
+        betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'absolute_pos_embed': dict(decay_mult=0.),
+            'relative_position_bias_table': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=5e-7 / 1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='MultiStepLR',
+        milestones=[700],
+        by_epoch=True,
+        begin=10,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py b/configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0be14486a3e29b14b78e507108f57d803404b8f
--- /dev/null
+++ b/configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_simmim_192.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimMIM',
+    backbone=dict(
+        type='SimMIMSwinTransformer',
+        arch='base',
+        img_size=192,
+        stage_cfgs=dict(block_cfgs=dict(window_size=6))),
+    neck=dict(
+        type='SimMIMLinearDecoder', in_channels=128 * 2**3, encoder_stride=32),
+    head=dict(
+        type='SimMIMHead',
+        patch_size=4,
+        loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3)))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=2e-4 * 2048 / 512,
+        betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'absolute_pos_embed': dict(decay_mult=0.),
+            'relative_position_bias_table': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-6 / 2e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        eta_min=1e-5 * 2048 / 512,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py b/configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0563023bd796e640c5c4caff2b9dc9bc555227c4
--- /dev/null
+++ b/configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_simmim_192.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimMIM',
+    backbone=dict(
+        type='SimMIMSwinTransformer',
+        arch='large',
+        img_size=192,
+        stage_cfgs=dict(block_cfgs=dict(window_size=12)),
+        pad_small_map=True),
+    neck=dict(
+        type='SimMIMLinearDecoder', in_channels=192 * 2**3, encoder_stride=32),
+    head=dict(
+        type='SimMIMHead',
+        patch_size=4,
+        loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3)))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1e-4 * 2048 / 512,
+        betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'absolute_pos_embed': dict(decay_mult=0.),
+            'relative_position_bias_table': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=5e-7 / 1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='MultiStepLR',
+        milestones=[700],
+        by_epoch=True,
+        begin=10,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/simsiam/README.md b/configs/simsiam/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..117e45bf7bec09a86558d3372663440d5859155f
--- /dev/null
+++ b/configs/simsiam/README.md
@@ -0,0 +1,87 @@
+# SimSiam
+
+> [Exploring simple siamese representation learning](https://arxiv.org/abs/2011.10566)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our “SimSiam” method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/149724180-bc7bac6a-fcb8-421e-b8f1-9550c624d154.png" width="500" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('simsiam_resnet50_8xb32-coslr-100e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                    | Params (M) | Flops (G) |                       Config                        |                                          Download                                          |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------: | :----------------------------------------------------------------------------------------: |
+| `simsiam_resnet50_8xb32-coslr-100e_in1k` |   38.20    |   4.11    | [config](simsiam_resnet50_8xb32-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.json) |
+| `simsiam_resnet50_8xb32-coslr-200e_in1k` |   38.20    |   4.11    | [config](simsiam_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k` | [SIMSIAM 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.pth) |   25.56    |   4.11    |   68.30   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.json) |
+| `resnet50_simsiam-200e-pre_8xb512-linear-coslr-90e_in1k` | [SIMSIAM 200-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.pth) |   25.56    |   4.11    |   69.80   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-519b5135.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-519b5135.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{chen2021exploring,
+  title={Exploring simple siamese representation learning},
+  author={Chen, Xinlei and He, Kaiming},
+  booktitle={CVPR},
+  year={2021}
+}
+```
diff --git a/configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simsiam/metafile.yml b/configs/simsiam/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40f6706511cf6cf49f8b65153ffd575348abeeca
--- /dev/null
+++ b/configs/simsiam/metafile.yml
@@ -0,0 +1,72 @@
+Collections:
+  - Name: SimSiam
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+    Paper:
+      Title: Exploring simple siamese representation learning
+      URL: https://arxiv.org/abs/2011.10566
+    README: configs/simsiam/README.md
+
+Models:
+  - Name: simsiam_resnet50_8xb32-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 38199360
+      Training Data: ImageNet-1k
+    In Collection: SimSiam
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.pth
+    Config: configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
+    Downstream:
+      - resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k
+  - Name: simsiam_resnet50_8xb32-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 38199360
+      Training Data: ImageNet-1k
+    In Collection: SimSiam
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.pth
+    Config: configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_simsiam-200e-pre_8xb512-linear-coslr-90e_in1k
+  - Name: resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SimSiam
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 68.3
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.pth
+    Config: configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
+  - Name: resnet50_simsiam-200e-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SimSiam
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-519b5135.pth
+    Config: configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py b/configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad19af6acaa530f0a0c3120034fa836cec965642
--- /dev/null
+++ b/configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_mocov2.py',
+    '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimSiam',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=2048,
+        num_layers=3,
+        with_last_bn_affine=False,
+        with_avg_pool=True),
+    head=dict(
+        type='LatentPredictHead',
+        loss=dict(type='CosineSimilarityLoss'),
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=2048,
+            hid_channels=512,
+            out_channels=2048,
+            with_avg_pool=False,
+            with_last_bn=False,
+            with_last_bias=True)),
+)
+
+# optimizer
+# set base learning rate
+lr = 0.05
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=lr, weight_decay=1e-4, momentum=0.9),
+    paramwise_cfg=dict(custom_keys={'predictor': dict(fix_lr=True)}))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=100, by_epoch=True, begin=0, end=100)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# additional hooks
+custom_hooks = [
+    dict(type='SimSiamHook', priority='HIGH', fix_pred_lr=True, lr=lr)
+]
diff --git a/configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py b/configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa3b2bbf5eb0b2f6c9b6907e78d189c13ea00cae
--- /dev/null
+++ b/configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,52 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_mocov2.py',
+    '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimSiam',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=2048,
+        num_layers=3,
+        with_last_bn_affine=False,
+        with_avg_pool=True),
+    head=dict(
+        type='LatentPredictHead',
+        loss=dict(type='CosineSimilarityLoss'),
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=2048,
+            hid_channels=512,
+            out_channels=2048,
+            with_avg_pool=False,
+            with_last_bn=False,
+            with_last_bias=True)),
+)
+
+# optimizer
+# set base learning rate
+lr = 0.05
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=lr, weight_decay=1e-4, momentum=0.9),
+    paramwise_cfg=dict(custom_keys={'predictor': dict(fix_lr=True)}))
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# additional hooks
+custom_hooks = [
+    dict(type='SimSiamHook', priority='HIGH', fix_pred_lr=True, lr=lr)
+]
diff --git a/configs/spark/README.md b/configs/spark/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..60f510e959dacac9fa48a5e0495be63e4fc1a03a
--- /dev/null
+++ b/configs/spark/README.md
@@ -0,0 +1,87 @@
+# SparK
+
+> [Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling](https://arxiv.org/abs/2301.03580)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, random-masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method called Sparse masKed modeling (SparK) is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). Improvements on object detection and instance segmentation are more substantial (up to +3.5%), verifying the strong transferability of features learned. We also find its favorable scaling behavior by observing more gains on larger models. All this evidence reveals a promising future of generative pre-training on convnets. Codes and models are released at https://github.com/keyu-tian/SparK.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/b93e8d6f-ec1e-4f27-b986-da470fabe7df" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_spark-pre_300e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('spark_sparse-resnet50_800e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                    | Params (M) | Flops (G) |                                Config                                 |                                 Download                                 |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------------------------: | :----------------------------------------------------------------------: |
+| `spark_sparse-resnet50_800e_in1k`        |   37.97    |   4.10    |     [config](spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py)     | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.json) |
+| `spark_sparse-convnextv2-tiny_800e_in1k` |   39.73    |   4.47    | [config](spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                 |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                  Download                   |
+| :------------------------------------ | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :-----------------------------------------: |
+| `resnet50_spark-pre_300e_in1k`        | [SPARK](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.pth) |   23.52    |   1.31    |   80.10   |   94.90   | [config](benchmarks/resnet50_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.json) |
+| `convnextv2-tiny_spark-pre_300e_in1k` | [SPARK](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.pth) |   28.64    |   4.47    |   82.80   |   96.30   | [config](benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark//spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k_20230612-ffc78743.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark//spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k_20230612-ffc78743.json) |
+
+## Citation
+
+```bibtex
+@Article{tian2023designing,
+  author  = {Keyu Tian and Yi Jiang and Qishuai Diao and Chen Lin and Liwei Wang and Zehuan Yuan},
+  title   = {Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling},
+  journal = {arXiv:2301.03580},
+  year    = {2023},
+}
+```
diff --git a/configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py b/configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..95ef81f16a8d1173702ccfe3313f1e85bdd561ef
--- /dev/null
+++ b/configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,122 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/default_runtime.py',
+]
+
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='NumpyToPIL', to_rgb=True),
+    dict(
+        type='torchvision/TrivialAugmentWide',
+        num_magnitude_bins=31,
+        interpolation='bicubic',
+        fill=None),
+    dict(type='PILToNumpy', to_bgr=True),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    dataset=dict(pipeline=train_pipeline),
+    sampler=dict(type='RepeatAugSampler', shuffle=True),
+)
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='tiny',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02, bias=0.),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        momentum=1e-4,
+        evaluate_on_origin=True,
+        priority='ABOVE_NORMAL')
+]
+
+# schedule settings
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=3.2e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.7,
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=280,
+        eta_min=1.0e-5,
+        by_epoch=True,
+        begin=20,
+        end=300)
+]
+train_cfg = dict(by_epoch=True, max_epochs=300)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # only keeps the latest 2 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py b/configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d7527ce2a545949a6395d847631b5c4484af398
--- /dev/null
+++ b/configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,107 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs256_rsb_a12.py',
+    '../../_base_/default_runtime.py'
+]
+# modification is based on ResNets RSB settings
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='NumpyToPIL', to_rgb=True),
+    dict(
+        type='torchvision/TrivialAugmentWide',
+        num_magnitude_bins=31,
+        interpolation='bicubic',
+        fill=None),
+    dict(type='PILToNumpy', to_bgr=True),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    backbone=dict(
+        norm_cfg=dict(type='SyncBN', requires_grad=True),
+        drop_path_rate=0.05,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, use_sigmoid=True)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.1),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# schedule settings
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='Lamb',
+        lr=0.016,
+        weight_decay=0.02,
+    ),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.7,
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
+train_cfg = dict(by_epoch=True, max_epochs=300)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # only keeps the latest 2 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/spark/metafile.yml b/configs/spark/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..81ca3a7033e7eeac1ef88a852613f4866854f625
--- /dev/null
+++ b/configs/spark/metafile.yml
@@ -0,0 +1,73 @@
+Collections:
+  - Name: SparK
+    Metadata:
+      Architecture:
+        - Dense Connections
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling'
+      URL: https://arxiv.org/abs/2301.03580
+    README: configs/spark/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: spark_sparse-resnet50_800e_in1k
+    Metadata:
+      FLOPs: 4100000000
+      Parameters: 37971000
+      Training Data:
+        - ImageNet-1k
+    In Collection: SparK
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.pth
+    Config: configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - resnet50_spark-pre_300e_in1k
+  - Name: resnet50_spark-pre_300e_in1k
+    Metadata:
+      FLOPs: 1310000000
+      Parameters: 23520000
+      Training Data:
+        - ImageNet-1k
+    In Collection: SparK
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.1
+          Top 5 Accuracy: 94.9
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.pth
+    Config: configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py
+
+  - Name: spark_sparse-convnextv2-tiny_800e_in1k
+    Metadata:
+      FLOPs: 4470000000
+      Parameters: 39732000
+      Training Data:
+        - ImageNet-1k
+    In Collection: SparK
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.pth
+    Config: configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py
+    Downstream:
+      - convnextv2-tiny_spark-pre_300e_in1k
+  - Name: convnextv2-tiny_spark-pre_300e_in1k
+    Metadata:
+      FLOPs: 4469631744
+      Parameters: 28635496
+      Training Data:
+        - ImageNet-1k
+    In Collection: SparK
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.8
+          Top 5 Accuracy: 96.3
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark//spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k_20230612-ffc78743.pth
+    Config: configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py
diff --git a/configs/spark/spark_sparse-convnext-small_16xb256-amp-coslr-800e_in1k.py b/configs/spark/spark_sparse-convnext-small_16xb256-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cefb5b93ae8bd79e501b2c6ab6b874c11751b44
--- /dev/null
+++ b/configs/spark/spark_sparse-convnext-small_16xb256-amp-coslr-800e_in1k.py
@@ -0,0 +1,81 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset 8 x 512
+train_dataloader = dict(batch_size=256, num_workers=8)
+
+# model settings
+model = dict(
+    type='SparK',
+    input_size=224,
+    downsample_raito=32,
+    mask_ratio=0.6,
+    enc_dec_norm_cfg=dict(type='SparseLN2d', eps=1e-6),
+    enc_dec_norm_dim=768,
+    backbone=dict(
+        type='SparseConvNeXt',
+        arch='small',
+        drop_path_rate=0.2,
+        out_indices=(0, 1, 2, 3),
+        gap_before_output=False),
+    neck=dict(
+        type='SparKLightDecoder',
+        feature_dim=512,
+        upsample_ratio=32,  # equal to downsample_raito
+        mid_channels=0,
+        last_act=False),
+    head=dict(
+        type='SparKPretrainHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optimizer = dict(
+    type='Lamb', lr=2e-4 * 4096 / 512, betas=(0.9, 0.95), weight_decay=0.04)
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=optimizer,
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingWeightDecay',
+        eta_min=0.2,
+        T_max=800,
+        by_epoch=True,
+        begin=0,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=100),
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py b/configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a1afc80821abb06fcafe956d1e3c3b919ab0f20
--- /dev/null
+++ b/configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py
@@ -0,0 +1,84 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset 16 x 256
+train_dataloader = dict(batch_size=256, num_workers=8)
+
+# model settings, use ConvNeXt V2
+model = dict(
+    type='SparK',
+    input_size=224,
+    downsample_raito=32,
+    mask_ratio=0.6,
+    enc_dec_norm_cfg=dict(type='SparseLN2d', eps=1e-6),
+    enc_dec_norm_dim=768,
+    backbone=dict(
+        type='SparseConvNeXt',
+        arch='tiny',
+        drop_path_rate=0.2,
+        out_indices=(0, 1, 2, 3),
+        gap_before_output=False,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    neck=dict(
+        type='SparKLightDecoder',
+        feature_dim=512,
+        upsample_ratio=32,  # equal to downsample_raito
+        mid_channels=0,
+        last_act=False),
+    head=dict(
+        type='SparKPretrainHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optimizer = dict(
+    type='Lamb', lr=2e-4 * 4096 / 512, betas=(0.9, 0.95), weight_decay=0.04)
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=optimizer,
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=780,
+        by_epoch=True,
+        begin=20,
+        end=800,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingWeightDecay',
+        eta_min=0.2,
+        T_max=800,
+        by_epoch=True,
+        begin=0,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=100),
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-1600e_in1k.py b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..10fc67574b705d2181f74db3d9d839a1812731e1
--- /dev/null
+++ b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,30 @@
+_base_ = 'spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py'
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingWeightDecay',
+        eta_min=0.2,
+        T_max=1600,
+        by_epoch=True,
+        begin=0,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(max_epochs=1600)
diff --git a/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..864f616209361ba63158f64d66ffb06c2693e9e8
--- /dev/null
+++ b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,80 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset 8 x 512
+train_dataloader = dict(batch_size=512, num_workers=8)
+
+# model settings
+model = dict(
+    type='SparK',
+    input_size=224,
+    downsample_raito=32,
+    mask_ratio=0.6,
+    enc_dec_norm_cfg=dict(type='SparseSyncBatchNorm2d'),
+    enc_dec_norm_dim=2048,
+    backbone=dict(
+        type='SparseResNet',
+        depth=50,
+        out_indices=(0, 1, 2, 3),
+        drop_path_rate=0.05),
+    neck=dict(
+        type='SparKLightDecoder',
+        feature_dim=512,
+        upsample_ratio=32,  # equal to downsample_raito
+        mid_channels=0,
+        last_act=False),
+    head=dict(
+        type='SparKPretrainHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optimizer = dict(
+    type='Lamb', lr=2e-4 * 4096 / 512, betas=(0.9, 0.95), weight_decay=0.04)
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=optimizer,
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingWeightDecay',
+        eta_min=0.2,
+        T_max=800,
+        by_epoch=True,
+        begin=0,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=100),
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/swav/README.md b/configs/swav/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fdcdfeb25e3c454d084bbf2d8a7b3d685c35c9fc
--- /dev/null
+++ b/configs/swav/README.md
@@ -0,0 +1,85 @@
+# SwAV
+
+> [Unsupervised Learning of Visual Features by Contrasting Cluster Assignments](https://arxiv.org/abs/2006.09882)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a “swapped” prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/149724517-9f1e7bdf-04c7-43e3-92f4-2b8fc1399123.png" width="500" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_swav-pre_8xb32-linear-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                                  | Params (M) | Flops (G) |                             Config                             |                             Download                              |
+| :----------------------------------------------------- | :--------: | :-------: | :------------------------------------------------------------: | :---------------------------------------------------------------: |
+| `swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px` |   28.35    |   4.11    | [config](swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_swav-pre_8xb32-linear-coslr-100e_in1k` | [SWAV](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.pth) |   25.56    |   4.11    |   70.50   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.json) |
+
+## Citation
+
+```bibtex
+@article{caron2020unsupervised,
+  title={Unsupervised Learning of Visual Features by Contrasting Cluster Assignments},
+  author={Caron, Mathilde and Misra, Ishan and Mairal, Julien and Goyal, Priya and Bojanowski, Piotr and Joulin, Armand},
+  booktitle={NeurIPS},
+  year={2020}
+}
+```
diff --git a/configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/swav/metafile.yml b/configs/swav/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5bc1252ad1ed6528d28847b728b85f3e91e7d0b9
--- /dev/null
+++ b/configs/swav/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+  - Name: SwAV
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+        - SwAV
+    Paper:
+      Title: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
+      URL: https://arxiv.org/abs/2006.09882
+    README: configs/swav/README.md
+
+Models:
+  - Name: swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 28354752
+      Training Data: ImageNet-1k
+    In Collection: SwAV
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.pth
+    Config: configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
+    Downstream:
+      - resnet50_swav-pre_8xb32-linear-coslr-100e_in1k
+  - Name: resnet50_swav-pre_8xb32-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SwAV
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.pth
+    Config: configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py b/configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebb9ead92ef84387aa8715c013be36eebb661dd8
--- /dev/null
+++ b/configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
@@ -0,0 +1,159 @@
+_base_ = [
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+num_crops = [2, 6]
+color_distort_strength = 1.0
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.14, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.8 * color_distort_strength,
+                contrast=0.8 * color_distort_strength,
+                saturation=0.8 * color_distort_strength,
+                hue=0.2 * color_distort_strength)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.5),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=96,
+        crop_ratio_range=(0.05, 0.14),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.8 * color_distort_strength,
+                contrast=0.8 * color_distort_strength,
+                saturation=0.8 * color_distort_strength,
+                hue=0.2 * color_distort_strength)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.5),
+    dict(type='RandomFlip', prob=0.5),
+]
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=num_crops,
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+batch_size = 32
+train_dataloader = dict(
+    batch_size=batch_size,
+    num_workers=8,
+    drop_last=True,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='SwAV',
+    data_preprocessor=dict(
+        mean=(123.675, 116.28, 103.53),
+        std=(58.395, 57.12, 57.375),
+        to_rgb=True),
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='SwAVNeck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        with_avg_pool=True),
+    head=dict(
+        type='SwAVHead',
+        loss=dict(
+            type='SwAVLoss',
+            feat_dim=128,  # equal to neck['out_channels']
+            epsilon=0.05,
+            temperature=0.1,
+            num_crops=num_crops,
+        )))
+
+# optimizer
+optim_wrapper = dict(type='OptimWrapper', optimizer=dict(type='LARS', lr=0.6))
+find_unused_parameters = True
+
+# learning policy
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        T_max=200,
+        eta_min=6e-4,
+        by_epoch=True,
+        begin=0,
+        end=200,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# additional hooks
+custom_hooks = [
+    dict(
+        type='SwAVHook',
+        priority='VERY_HIGH',
+        batch_size=batch_size,
+        epoch_queue_starts=15,
+        crops_for_assign=[0, 1],
+        feat_dim=128,
+        queue_length=3840,
+        frozen_layers_cfg=dict(prototypes=5005))
+]
diff --git a/configs/swin_transformer/README.md b/configs/swin_transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1d41f13a52554d7dd5896d284cd22b47b6b1fc8a
--- /dev/null
+++ b/configs/swin_transformer/README.md
@@ -0,0 +1,111 @@
+# Swin-Transformer
+
+> [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**Swin Transformer** (the name **Swin** stands for Shifted window) is initially described in [the paper](https://arxiv.org/pdf/2103.14030.pdf), which capably serves as a general-purpose backbone for computer vision. It is basically a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
+
+Swin Transformer achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142576715-14668c6b-5cb8-4de8-ac51-419fae773c90.png" width="90%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with **Shifted windows**. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('swin-tiny_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('swin-tiny_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/swin_transformer/swin-tiny_16xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/swin_transformer/swin-tiny_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                      |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                               Download                               |
+| :----------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :------------------------------------------------------------------: |
+| `swin-tiny_16xb64_in1k`                    | From scratch |   28.29    |   4.36    |   81.18   |   95.61   |    [config](swin-tiny_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925.json) |
+| `swin-small_16xb64_in1k`                   | From scratch |   49.61    |   8.52    |   83.02   |   96.29   |    [config](swin-small_16xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219-7f9d988b.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219.json) |
+| `swin-base_16xb64_in1k`                    | From scratch |   87.77    |   15.14   |   83.36   |   96.44   |    [config](swin-base_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742.json) |
+| `swin-tiny_3rdparty_in1k`\*                | From scratch |   28.29    |   4.36    |   81.18   |   95.52   |    [config](swin-tiny_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_tiny_patch4_window7_224-160bb0a5.pth) |
+| `swin-small_3rdparty_in1k`\*               | From scratch |   49.61    |   8.52    |   83.21   |   96.25   |    [config](swin-small_16xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_small_patch4_window7_224-cc7a01c9.pth) |
+| `swin-base_3rdparty_in1k`\*                | From scratch |   87.77    |   15.14   |   83.42   |   96.44   |    [config](swin-base_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224-4670dd19.pth) |
+| `swin-base_3rdparty_in1k-384`\*            | From scratch |   87.90    |   44.49   |   84.49   |   96.95   | [config](swin-base_16xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384-02c598a4.pth) |
+| `swin-base_in21k-pre-3rdparty_in1k`\*      | From scratch |   87.77    |   15.14   |   85.16   |   97.50   |    [config](swin-base_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224_22kto1k-f967f799.pth) |
+| `swin-base_in21k-pre-3rdparty_in1k-384`\*  | From scratch |   87.90    |   44.49   |   86.44   |   98.05   | [config](swin-base_16xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384_22kto1k-d59b0d1d.pth) |
+| `swin-large_in21k-pre-3rdparty_in1k`\*     | From scratch |   196.53   |   34.04   |   86.24   |   97.88   |    [config](swin-large_16xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window7_224_22kto1k-5f0996db.pth) |
+| `swin-large_in21k-pre-3rdparty_in1k-384`\* | From scratch |   196.74   |  100.04   |   87.25   |   98.25   | [config](swin-large_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window12_384_22kto1k-0a40944b.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on CUB-200-2011
+
+| Model                       |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) |                 Config                 |                                            Download                                             |
+| :-------------------------- | :----------: | :--------: | :-------: | :-------: | :------------------------------------: | :---------------------------------------------------------------------------------------------: |
+| `swin-large_8xb8_cub-384px` | From scratch |   195.51   |  100.04   |   91.87   | [config](swin-large_8xb8_cub-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.json) |
+
+## Citation
+
+```bibtex
+@article{liu2021Swin,
+  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
+  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
+  journal={arXiv preprint arXiv:2103.14030},
+  year={2021}
+}
+```
diff --git a/configs/swin_transformer/metafile.yml b/configs/swin_transformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8bff599267afe52a0904c106be4fcd8c76f6e4bf
--- /dev/null
+++ b/configs/swin_transformer/metafile.yml
@@ -0,0 +1,201 @@
+Collections:
+  - Name: Swin-Transformer
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Training Resources: 16x V100 GPUs
+      Epochs: 300
+      Batch Size: 1024
+      Architecture:
+        - Shift Window Multihead Self Attention
+    Paper:
+      URL: https://arxiv.org/abs/2103.14030
+      Title: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"
+    README: configs/swin_transformer/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/swin_transformer.py#L176
+      Version: v0.15.0
+
+Models:
+  - Name: swin-tiny_16xb64_in1k
+    Metadata:
+      FLOPs: 4360000000
+      Parameters: 28290000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.18
+          Top 5 Accuracy: 95.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth
+    Config: configs/swin_transformer/swin-tiny_16xb64_in1k.py
+  - Name: swin-small_16xb64_in1k
+    Metadata:
+      FLOPs: 8520000000
+      Parameters: 49610000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.02
+          Top 5 Accuracy: 96.29
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219-7f9d988b.pth
+    Config: configs/swin_transformer/swin-small_16xb64_in1k.py
+  - Name: swin-base_16xb64_in1k
+    Metadata:
+      FLOPs: 15140000000
+      Parameters: 87770000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.36
+          Top 5 Accuracy: 96.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth
+    Config: configs/swin_transformer/swin-base_16xb64_in1k.py
+  - Name: swin-tiny_3rdparty_in1k
+    Metadata:
+      FLOPs: 4360000000
+      Parameters: 28290000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.18
+          Top 5 Accuracy: 95.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_tiny_patch4_window7_224-160bb0a5.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-tiny_16xb64_in1k.py
+  - Name: swin-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 8520000000
+      Parameters: 49610000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.21
+          Top 5 Accuracy: 96.25
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_small_patch4_window7_224-cc7a01c9.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_small_patch4_window7_224.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-small_16xb64_in1k.py
+  - Name: swin-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 15140000000
+      Parameters: 87770000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.42
+          Top 5 Accuracy: 96.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224-4670dd19.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-base_16xb64_in1k.py
+  - Name: swin-base_3rdparty_in1k-384
+    Metadata:
+      FLOPs: 44490000000
+      Parameters: 87900000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.49
+          Top 5 Accuracy: 96.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384-02c598a4.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-base_16xb64_in1k-384px.py
+  - Name: swin-base_in21k-pre-3rdparty_in1k
+    Metadata:
+      FLOPs: 15140000000
+      Parameters: 87770000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.16
+          Top 5 Accuracy: 97.50
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224_22kto1k-f967f799.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224_22kto1k.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-base_16xb64_in1k.py
+  - Name: swin-base_in21k-pre-3rdparty_in1k-384
+    Metadata:
+      FLOPs: 44490000000
+      Parameters: 87900000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.44
+          Top 5 Accuracy: 98.05
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384_22kto1k-d59b0d1d.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384_22kto1k.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-base_16xb64_in1k-384px.py
+  - Name: swin-large_in21k-pre-3rdparty_in1k
+    Metadata:
+      FLOPs: 34040000000
+      Parameters: 196530000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.24
+          Top 5 Accuracy: 97.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window7_224_22kto1k-5f0996db.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window7_224_22kto1k.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-large_16xb64_in1k.py
+  - Name: swin-large_in21k-pre-3rdparty_in1k-384
+    Metadata:
+      FLOPs: 100040000000
+      Parameters: 196740000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.25
+          Top 5 Accuracy: 98.25
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window12_384_22kto1k-0a40944b.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window12_384_22kto1k.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-large_16xb64_in1k-384px.py
+  - Name: swin-large_8xb8_cub-384px
+    Metadata:
+      FLOPs: 100040000000
+      Parameters: 195510000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: CUB-200-2011
+        Metrics:
+          Top 1 Accuracy: 91.87
+        Task: Image Classification
+    Pretrain: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-large_3rdparty_in21k-384px.pth
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.pth
+    Config: configs/swin_transformer/swin-large_8xb8_cub-384px.py
diff --git a/configs/swin_transformer/swin-base_16xb64_in1k-384px.py b/configs/swin_transformer/swin-base_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..10f89921ff1ec6659509ccdee8e15cfe52395880
--- /dev/null
+++ b/configs/swin_transformer/swin-base_16xb64_in1k-384px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/base_384.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-base_16xb64_in1k.py b/configs/swin_transformer/swin-base_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..05a95b4483dd3764abbcf9e32b1291334e084099
--- /dev/null
+++ b/configs/swin_transformer/swin-base_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/base_224.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-large_16xb64_in1k-384px.py b/configs/swin_transformer/swin-large_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ba52b3564704acfeb2c40eb39e1d4e5cf5bf573
--- /dev/null
+++ b/configs/swin_transformer/swin-large_16xb64_in1k-384px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/large_384.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-large_16xb64_in1k.py b/configs/swin_transformer/swin-large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..36121efca15f951a03d153b614d3e844cc8cad26
--- /dev/null
+++ b/configs/swin_transformer/swin-large_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/large_224.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-large_8xb8_cub-384px.py b/configs/swin_transformer/swin-large_8xb8_cub-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2f10a6a292bc2485085a38c895b635a5944d04c
--- /dev/null
+++ b/configs/swin_transformer/swin-large_8xb8_cub-384px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/swin_transformer/large_384.py',
+    '../_base_/datasets/cub_bs8_384.py',
+    '../_base_/schedules/cub_bs64.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+checkpoint = 'https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-large_3rdparty_in21k-384px.pth'  # noqa
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        init_cfg=dict(
+            type='Pretrained', checkpoint=checkpoint, prefix='backbone')),
+    head=dict(num_classes=200, ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        _delete_=True,
+        type='AdamW',
+        lr=5e-6,
+        weight_decay=0.0005,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
+
+default_hooks = dict(
+    # log every 20 intervals
+    logger=dict(type='LoggerHook', interval=20),
+    # save last three checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/swin_transformer/swin-small_16xb64_in1k.py b/configs/swin_transformer/swin-small_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c1a8e21a7f2cbc881cbde43c19af9cd10b7c2ba
--- /dev/null
+++ b/configs/swin_transformer/swin-small_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/small_224.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-tiny_16xb64_in1k.py b/configs/swin_transformer/swin-tiny_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a1ce2508ab603b008640583de78c64d2f178620
--- /dev/null
+++ b/configs/swin_transformer/swin-tiny_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/tiny_224.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer_v2/README.md b/configs/swin_transformer_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dd20548ae780ebca6cf0cc982ea71c782e369b52
--- /dev/null
+++ b/configs/swin_transformer_v2/README.md
@@ -0,0 +1,121 @@
+# Swin-Transformer V2
+
+> [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**Swin Transformer V2** is a work on the scale up visual model based on [Swin Transformer](https://github.com/open-mmlab/mmpretrain/tree/main/configs/swin_transformer). In the visual field, We can not increase the performance by just simply scaling up the visual model like NLP models. The possible reasons mentioned in the article are:
+
+- Training instability when increasing the vision model
+- Migrating the model trained at low resolution to a larger scale resolution task
+- Too mush GPU memory
+
+To solve it, The following method improvements are proposed in the paper:
+
+- post normalization: layer normalization after self-attention layer and MLP block
+- scaled cosine attention approach: use cosine similarity to calculate the relationship between token pairs
+- log-spaced continuous position bias: redefine relative position encoding
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/180748696-ee7ed23d-7fee-4ccf-9eb5-f117db228a42.png" width="100%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the detailed Abstract</summary>
+
+<br>
+
+Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.
+
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('swinv2-tiny-w8_3rdparty_in1k-256px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('swinv2-tiny-w8_3rdparty_in1k-256px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                     | Params (M) | Flops (G) |                      Config                      |                                           Download                                           |
+| :---------------------------------------- | :--------: | :-------: | :----------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `swinv2-base-w12_3rdparty_in21k-192px`\*  |   87.92    |   8.51    | [config](swinv2-base-w12_8xb128_in21k-192px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-base-w12_3rdparty_in21k-192px_20220803-f7dc9763.pth) |
+| `swinv2-large-w12_3rdparty_in21k-192px`\* |   196.74   |   19.04   | [config](swinv2-large-w12_8xb128_in21k-192px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-large-w12_3rdparty_in21k-192px_20220803-d9073fee.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                       Config                       |                       Download                       |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------------: | :--------------------------------------------------: |
+| `swinv2-tiny-w8_3rdparty_in1k-256px`\*            | From scratch |   28.35    |   4.35    |   81.76   |   95.87   |   [config](swinv2-tiny-w8_16xb64_in1k-256px.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth) |
+| `swinv2-tiny-w16_3rdparty_in1k-256px`\*           | From scratch |   28.35    |   4.40    |   82.81   |   96.23   |   [config](swinv2-tiny-w16_16xb64_in1k-256px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w16_3rdparty_in1k-256px_20220803-9651cdd7.pth) |
+| `swinv2-small-w8_3rdparty_in1k-256px`\*           | From scratch |   49.73    |   8.45    |   83.74   |   96.60   |   [config](swinv2-small-w8_16xb64_in1k-256px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w8_3rdparty_in1k-256px_20220803-b01a4332.pth) |
+| `swinv2-small-w16_3rdparty_in1k-256px`\*          | From scratch |   49.73    |   8.57    |   84.13   |   96.83   |  [config](swinv2-small-w16_16xb64_in1k-256px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w16_3rdparty_in1k-256px_20220803-b707d206.pth) |
+| `swinv2-base-w8_3rdparty_in1k-256px`\*            | From scratch |   87.92    |   14.99   |   84.20   |   96.86   |   [config](swinv2-base-w8_16xb64_in1k-256px.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w8_3rdparty_in1k-256px_20220803-8ff28f2b.pth) |
+| `swinv2-base-w16_3rdparty_in1k-256px`\*           | From scratch |   87.92    |   15.14   |   84.60   |   97.05   |   [config](swinv2-base-w16_16xb64_in1k-256px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_3rdparty_in1k-256px_20220803-5a1886b7.pth) |
+| `swinv2-base-w16_in21k-pre_3rdparty_in1k-256px`\* | ImageNet-21k |   87.92    |   15.14   |   86.17   |   97.88   | [config](swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_in21k-pre_3rdparty_in1k-256px_20220803-8d7aa8ad.pth) |
+| `swinv2-base-w24_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   87.92    |   34.07   |   87.14   |   98.23   | [config](swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w24_in21k-pre_3rdparty_in1k-384px_20220803-44eb70f8.pth) |
+| `swinv2-large-w16_in21k-pre_3rdparty_in1k-256px`\* | ImageNet-21k |   196.75   |   33.86   |   86.93   |   98.06   | [config](swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w16_in21k-pre_3rdparty_in1k-256px_20220803-c40cbed7.pth) |
+| `swinv2-large-w24_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   196.75   |   76.20   |   87.59   |   98.27   | [config](swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w24_in21k-pre_3rdparty_in1k-384px_20220803-3b36c165.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{https://doi.org/10.48550/arxiv.2111.09883,
+  doi = {10.48550/ARXIV.2111.09883},
+  url = {https://arxiv.org/abs/2111.09883},
+  author = {Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, Yue and Zhang, Zheng and Dong, Li and Wei, Furu and Guo, Baining},
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {Swin Transformer V2: Scaling Up Capacity and Resolution},
+  publisher = {arXiv},
+  year = {2021},
+  copyright = {Creative Commons Attribution 4.0 International}
+}
+```
diff --git a/configs/swin_transformer_v2/metafile.yml b/configs/swin_transformer_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..55a14cbab587f037d96583d3b0210ac3008b1118
--- /dev/null
+++ b/configs/swin_transformer_v2/metafile.yml
@@ -0,0 +1,206 @@
+Collections:
+  - Name: Swin-Transformer V2
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Training Resources: 16x V100 GPUs
+      Epochs: 300
+      Batch Size: 1024
+      Architecture:
+        - Shift Window Multihead Self Attention
+    Paper:
+      URL: https://arxiv.org/abs/2111.09883
+      Title: "Swin Transformer V2: Scaling Up Capacity and Resolution"
+    README: configs/swin_transformer_v2/README.md
+
+Models:
+  - Name: swinv2-tiny-w8_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 4350000000
+      Parameters: 28350000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.76
+          Top 5 Accuracy: 95.87
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth
+    Config: configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_tiny_patch4_window8_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-tiny-w16_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 4400000000
+      Parameters: 28350000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.81
+          Top 5 Accuracy: 96.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w16_3rdparty_in1k-256px_20220803-9651cdd7.pth
+    Config: configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_tiny_patch4_window16_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-small-w8_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 8450000000
+      Parameters: 49730000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.74
+          Top 5 Accuracy: 96.6
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w8_3rdparty_in1k-256px_20220803-b01a4332.pth
+    Config: configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_small_patch4_window8_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-small-w16_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 8570000000
+      Parameters: 49730000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.13
+          Top 5 Accuracy: 96.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w16_3rdparty_in1k-256px_20220803-b707d206.pth
+    Config: configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_small_patch4_window16_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w8_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 14990000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.2
+          Top 5 Accuracy: 96.86
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w8_3rdparty_in1k-256px_20220803-8ff28f2b.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window8_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w16_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 15140000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.6
+          Top 5 Accuracy: 97.05
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_3rdparty_in1k-256px_20220803-5a1886b7.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window16_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w16_in21k-pre_3rdparty_in1k-256px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 15140000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.17
+          Top 5 Accuracy: 97.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_in21k-pre_3rdparty_in1k-256px_20220803-8d7aa8ad.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12to16_192to256_22kto1k_ft.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w24_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 34070000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.14
+          Top 5 Accuracy: 98.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w24_in21k-pre_3rdparty_in1k-384px_20220803-44eb70f8.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12to24_192to384_22kto1k_ft.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-large-w16_in21k-pre_3rdparty_in1k-256px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 33860000000
+      Parameters: 196750000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.93
+          Top 5 Accuracy: 98.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w16_in21k-pre_3rdparty_in1k-256px_20220803-c40cbed7.pth
+    Config: configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12to16_192to256_22kto1k_ft.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-large-w24_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 76200000000
+      Parameters: 196750000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.59
+          Top 5 Accuracy: 98.27
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w24_in21k-pre_3rdparty_in1k-384px_20220803-3b36c165.pth
+    Config: configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12to24_192to384_22kto1k_ft.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w12_3rdparty_in21k-192px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 8510000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-base-w12_3rdparty_in21k-192px_20220803-f7dc9763.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12_192_22k.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-large-w12_3rdparty_in21k-192px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 19040000000
+      Parameters: 196740000
+    In Collection: Swin-Transformer V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-large-w12_3rdparty_in21k-192px_20220803-d9073fee.pth
+    Config: configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12_192_22k.pth
+      Code: https://github.com/microsoft/Swin-Transformer
diff --git a/configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py b/configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b01b75d296dae9db97d2d85f73463f6c87c0b1c
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py
@@ -0,0 +1,19 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(img_size=192, window_size=[12, 12, 12, 6]),
+    head=dict(num_classes=21841),
+)
+
+# dataset settings
+data_preprocessor = dict(num_classes=21841)
+
+_base_['train_pipeline'][1]['scale'] = 192  # RandomResizedCrop
+_base_['test_pipeline'][1]['scale'] = 219  # ResizeEdge
+_base_['test_pipeline'][2]['crop_size'] = 192  # CenterCrop
diff --git a/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f375ee1fc9b10885f8b9d9f4794b8530c1460b5
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
diff --git a/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0725f9e739a099551a4d5b5f007bcb83708be309
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
@@ -0,0 +1,13 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        window_size=[16, 16, 16, 8],
+        drop_path_rate=0.2,
+        pretrained_window_sizes=[12, 12, 12, 6]))
diff --git a/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py b/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..3dd4e5fd935a356d29e7790e91d4538c94711062
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
@@ -0,0 +1,14 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_384.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        img_size=384,
+        window_size=[24, 24, 24, 12],
+        drop_path_rate=0.2,
+        pretrained_window_sizes=[12, 12, 12, 6]))
diff --git a/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..23fc40701470f8e41252c274072896d1cd811f28
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py b/configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b01b75d296dae9db97d2d85f73463f6c87c0b1c
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py
@@ -0,0 +1,19 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(img_size=192, window_size=[12, 12, 12, 6]),
+    head=dict(num_classes=21841),
+)
+
+# dataset settings
+data_preprocessor = dict(num_classes=21841)
+
+_base_['train_pipeline'][1]['scale'] = 192  # RandomResizedCrop
+_base_['test_pipeline'][1]['scale'] = 219  # ResizeEdge
+_base_['test_pipeline'][2]['crop_size'] = 192  # CenterCrop
diff --git a/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..62a2a29b843f197c15d8f53a7cbd1029be675fa8
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
@@ -0,0 +1,13 @@
+# Only for evaluation
+_base_ = [
+    '../_base_/models/swin_transformer_v2/large_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        window_size=[16, 16, 16, 8], pretrained_window_sizes=[12, 12, 12, 6]),
+)
diff --git a/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py b/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..d97d9b2b869c1e0c264910859b6f980387a7b6ab
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
@@ -0,0 +1,15 @@
+# Only for evaluation
+_base_ = [
+    '../_base_/models/swin_transformer_v2/large_384.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        img_size=384,
+        window_size=[24, 24, 24, 12],
+        pretrained_window_sizes=[12, 12, 12, 6]),
+)
diff --git a/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f87265dd199c712a6442407db852b5d4b6aabd7d
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/small_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
diff --git a/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1001f1b6e1978c3706ca6183f863c316b13ade4
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/small_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e1f290f371e1b9084f4cd5291e1e638d0ad54e3
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/tiny_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
diff --git a/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cdc9a25ae8a64758f8642c079e1ff7fbf0548c3
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/tiny_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/t2t_vit/README.md b/configs/t2t_vit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bf0967cf27f606788174bc9fc2198cad3dbfced6
--- /dev/null
+++ b/configs/t2t_vit/README.md
@@ -0,0 +1,81 @@
+# Tokens-to-Token ViT
+
+> [Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet](https://arxiv.org/abs/2101.11986)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384×384 on ImageNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142578381-e9040610-05d9-457c-8bf5-01c2fa94add2.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('t2t-vit-t-14_8xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('t2t-vit-t-14_8xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                          Download                                          |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `t2t-vit-t-14_8xb64_in1k` | From scratch |   21.47    |   4.34    |   81.83   |   95.84   | [config](t2t-vit-t-14_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.json) |
+| `t2t-vit-t-19_8xb64_in1k` | From scratch |   39.08    |   7.80    |   82.63   |   96.18   | [config](t2t-vit-t-19_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.json) |
+| `t2t-vit-t-24_8xb64_in1k` | From scratch |   64.00    |   12.69   |   82.71   |   96.09   | [config](t2t-vit-t-24_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.json) |
+
+## Citation
+
+```bibtex
+@article{yuan2021tokens,
+  title={Tokens-to-token vit: Training vision transformers from scratch on imagenet},
+  author={Yuan, Li and Chen, Yunpeng and Wang, Tao and Yu, Weihao and Shi, Yujun and Tay, Francis EH and Feng, Jiashi and Yan, Shuicheng},
+  journal={arXiv preprint arXiv:2101.11986},
+  year={2021}
+}
+```
diff --git a/configs/t2t_vit/metafile.yml b/configs/t2t_vit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..72cb2dfc92899779846af6263a125d028d17d1b2
--- /dev/null
+++ b/configs/t2t_vit/metafile.yml
@@ -0,0 +1,58 @@
+Collections:
+  - Name: Tokens-to-Token ViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Layer Normalization
+        - Scaled Dot-Product Attention
+        - Attention Dropout
+        - Dropout
+        - Tokens to Token
+    Paper:
+      URL: https://arxiv.org/abs/2101.11986
+      Title: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet"
+    README: configs/t2t_vit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.17.0/mmcls/models/backbones/t2t_vit.py
+      Version: v0.17.0
+
+Models:
+  - Name: t2t-vit-t-14_8xb64_in1k
+    Metadata:
+      FLOPs: 4340000000
+      Parameters: 21470000
+    In Collection: Tokens-to-Token ViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.83
+          Top 5 Accuracy: 95.84
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth
+    Config: configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
+  - Name: t2t-vit-t-19_8xb64_in1k
+    Metadata:
+      FLOPs: 7800000000
+      Parameters: 39080000
+    In Collection: Tokens-to-Token ViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.63
+          Top 5 Accuracy: 96.18
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.pth
+    Config: configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
+  - Name: t2t-vit-t-24_8xb64_in1k
+    Metadata:
+      FLOPs: 12690000000
+      Parameters: 64000000
+    In Collection: Tokens-to-Token ViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.71
+          Top 5 Accuracy: 96.09
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.pth
+    Config: configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
diff --git a/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py b/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ff6444548c4be59f52bc2aa259e7aaac32dea3d
--- /dev/null
+++ b/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
@@ -0,0 +1,49 @@
+_base_ = [
+    '../_base_/models/t2t-vit-t-14.py',
+    '../_base_/datasets/imagenet_bs64_t2t_224.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={'cls_token': dict(decay_mult=0.0)},
+    ),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300),
+    # cool down learning rate scheduler
+    dict(type='ConstantLR', factor=0.1, by_epoch=True, begin=300, end=310),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=310, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py b/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c7275372f904a4d53453b37bb50bfd31edb842f
--- /dev/null
+++ b/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
@@ -0,0 +1,49 @@
+_base_ = [
+    '../_base_/models/t2t-vit-t-19.py',
+    '../_base_/datasets/imagenet_bs64_t2t_224.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.065),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={'cls_token': dict(decay_mult=0.0)},
+    ),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300),
+    # cool down learning rate scheduler
+    dict(type='ConstantLR', factor=0.1, by_epoch=True, begin=300, end=310),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=310, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py b/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e180ff344bd88808e635f3004704c6079a03465b
--- /dev/null
+++ b/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
@@ -0,0 +1,49 @@
+_base_ = [
+    '../_base_/models/t2t-vit-t-24.py',
+    '../_base_/datasets/imagenet_bs64_t2t_224.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.065),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={'cls_token': dict(decay_mult=0.0)},
+    ),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300),
+    # cool down learning rate scheduler
+    dict(type='ConstantLR', factor=0.1, by_epoch=True, begin=300, end=310),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=310, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/tinyvit/README.md b/configs/tinyvit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..58ceb5779b474a9818843cec0d34e8fc8f178f4b
--- /dev/null
+++ b/configs/tinyvit/README.md
@@ -0,0 +1,82 @@
+# TinyViT
+
+> [TinyViT: Fast Pretraining Distillation for Small Vision Transformers](https://arxiv.org/abs/2207.10666)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to SwinB pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks.
+
+<div align=center>
+<img src="https://github.com/microsoft/Cream/raw/main/TinyViT/.figure/framework.png" width="100%">
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('tinyvit-5m_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('tinyvit-5m_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/tinyvit/tinyvit-5m_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_3rdparty_in1k_20221021-62cb5abf.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                          |       Pretrain       | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                      |                      Download                      |
+| :--------------------------------------------- | :------------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------------: | :------------------------------------------------: |
+| `tinyvit-5m_3rdparty_in1k`\*                   |     From scratch     |    5.39    |   1.29    |   79.02   |   94.74   |       [config](tinyvit-5m_8xb256_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_3rdparty_in1k_20221021-62cb5abf.pth) |
+| `tinyvit-5m_in21k-distill-pre_3rdparty_in1k`\* | ImageNet-21k DISTILL |    5.39    |   1.29    |   80.71   |   95.57   |   [config](tinyvit-5m-distill_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_in21k-distill-pre_3rdparty_in1k_20221021-d4b010a8.pth) |
+| `tinyvit-11m_3rdparty_in1k`\*                  |     From scratch     |   11.00    |   2.05    |   81.44   |   95.79   |      [config](tinyvit-11m_8xb256_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_3rdparty_in1k_20221021-11ccef16.pth) |
+| `tinyvit-11m_in21k-distill-pre_3rdparty_in1k`\* | ImageNet-21k DISTILL |   11.00    |   2.05    |   83.19   |   96.53   |  [config](tinyvit-11m-distill_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_in21k-distill-pre_3rdparty_in1k_20221021-5d3bc0dc.pth) |
+| `tinyvit-21m_3rdparty_in1k`\*                  |     From scratch     |   21.20    |   4.30    |   83.08   |   96.58   |      [config](tinyvit-21m_8xb256_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_3rdparty_in1k_20221021-5346ba34.pth) |
+| `tinyvit-21m_in21k-distill-pre_3rdparty_in1k`\* | ImageNet-21k DISTILL |   21.20    |   4.30    |   84.85   |   97.27   |  [config](tinyvit-21m-distill_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k_20221021-3d9b30a2.pth) |
+| `tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px`\* | ImageNet-21k DISTILL |   21.23    |   13.85   |   86.21   |   97.77   | [config](tinyvit-21m-distill_8xb256_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px_20221021-65be6b3f.pth) |
+| `tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px`\* | ImageNet-21k DISTILL |   21.27    |   27.15   |   86.44   |   97.89   | [config](tinyvit-21m-distill_8xb256_in1k-512px.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px_20221021-e42a9bea.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Cream/tree/main/TinyViT). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@InProceedings{tiny_vit,
+  title={TinyViT: Fast Pretraining Distillation for Small Vision Transformers},
+  author={Wu, Kan and Zhang, Jinnian and Peng, Houwen and Liu, Mengchen and Xiao, Bin and Fu, Jianlong and Yuan, Lu},
+  booktitle={European conference on computer vision (ECCV)},
+  year={2022}
+}
+```
diff --git a/configs/tinyvit/metafile.yml b/configs/tinyvit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..a1c5438acb9eba87f7a5e8c02356459c1194d74a
--- /dev/null
+++ b/configs/tinyvit/metafile.yml
@@ -0,0 +1,162 @@
+Collections:
+  - Name: TinyViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - MBConv
+        - Window Multi-head Self-Attention
+    Paper:
+      Title: 'TinyViT: Fast Pretraining Distillation for Small Vision Transformers'
+      URL: https://arxiv.org/abs/2207.10666
+    README: configs/tinyvit/README.md
+    Code:
+      Version: v1.0.0rc1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.2/mmcls/models/backbones/tinyvit.py
+
+Models:
+  - Name: tinyvit-5m_3rdparty_in1k
+    Metadata:
+      FLOPs: 1286655360
+      Parameters: 5392764
+      Training Data: ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.02
+          Top 5 Accuracy: 94.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_3rdparty_in1k_20221021-62cb5abf.pth
+    Config: configs/tinyvit/tinyvit-5m_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_5m_1k.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-5m_in21k-distill-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 1286655360
+      Parameters: 5392764
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.71
+          Top 5 Accuracy: 95.57
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_in21k-distill-pre_3rdparty_in1k_20221021-d4b010a8.pth
+    Config: configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_5m_22kto1k_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-11m_3rdparty_in1k
+    Metadata:
+      FLOPs: 2050033664
+      Parameters: 10996972
+      Training Data: ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.44
+          Top 5 Accuracy: 95.79
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_3rdparty_in1k_20221021-11ccef16.pth
+    Config: configs/tinyvit/tinyvit-11m_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_11m_1k.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-11m_in21k-distill-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 2050033664
+      Parameters: 10996972
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.19
+          Top 5 Accuracy: 96.53
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_in21k-distill-pre_3rdparty_in1k_20221021-5d3bc0dc.pth
+    Config: configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_11m_22kto1k_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-21m_3rdparty_in1k
+    Metadata:
+      FLOPs: 4301124096
+      Parameters: 21198568
+      Training Data: ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.08
+          Top 5 Accuracy: 96.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_3rdparty_in1k_20221021-5346ba34.pth
+    Config: configs/tinyvit/tinyvit-21m_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_1k.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-21m_in21k-distill-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 4301124096
+      Parameters: 21198568
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.85
+          Top 5 Accuracy: 97.27
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k_20221021-3d9b30a2.pth
+    Config: configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_22kto1k_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 13848250176
+      Parameters: 21230488
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.21
+          Top 5 Accuracy: 97.77
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px_20221021-65be6b3f.pth
+    Config: configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_22kto1k_384_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px
+    Metadata:
+      FLOPs: 27151420224
+      Parameters: 21268120
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.44
+          Top 5 Accuracy: 97.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px_20221021-e42a9bea.pth
+    Config: configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_22kto1k_512_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
diff --git a/configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py b/configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..145feb9aa65baf4bba947cdebb6e8dad5b9781f5
--- /dev/null
+++ b/configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = [
+    './tinyvit-11m_8xb256_in1k.py',
+]
diff --git a/configs/tinyvit/tinyvit-11m_8xb256_in1k.py b/configs/tinyvit/tinyvit-11m_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3acfa86a0d5fa24aae44c01064c49f5348d7da3
--- /dev/null
+++ b/configs/tinyvit/tinyvit-11m_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-11m.py',
+]
diff --git a/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e51b1930dd96c987dd4eab9dd77d0e068c801c
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py
@@ -0,0 +1,29 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-21m.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        img_size=(384, 384),
+        window_size=[12, 12, 24, 12],
+        drop_path_rate=0.1,
+    ))
+
+# data settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py
new file mode 100644
index 0000000000000000000000000000000000000000..05b47c6de94868a6df6ec95cd406095dfc80153e
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py
@@ -0,0 +1,28 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-21m.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        img_size=(512, 512),
+        window_size=[16, 16, 32, 16],
+        drop_path_rate=0.1,
+    ))
+# data settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(512, 512),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(batch_size=16, dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..53885852757c6dce993addb6772b7d6e98219d81
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = [
+    './tinyvit-21m_8xb256_in1k.py',
+]
diff --git a/configs/tinyvit/tinyvit-21m_8xb256_in1k.py b/configs/tinyvit/tinyvit-21m_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c12019c9cf0babe49b24a21fa74fc66d33dda91
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-21m.py',
+]
diff --git a/configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py b/configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0003c30ac46d2dbe2069733a17b039133b95ae8a
--- /dev/null
+++ b/configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = [
+    './tinyvit-5m_8xb256_in1k.py',
+]
diff --git a/configs/tinyvit/tinyvit-5m_8xb256_in1k.py b/configs/tinyvit/tinyvit-5m_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..262b5a469c4daa7ed135e466e872bb57e0f1f148
--- /dev/null
+++ b/configs/tinyvit/tinyvit-5m_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-5m.py',
+]
diff --git a/configs/tnt/README.md b/configs/tnt/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e86da0b4a8d31a09b6f41e99cff4c233e67a114a
--- /dev/null
+++ b/configs/tnt/README.md
@@ -0,0 +1,77 @@
+# Transformer in Transformer
+
+> [Transformer in Transformer](https://arxiv.org/abs/2103.00112)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16×16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4×4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142578661-298d92a1-2e25-4910-a312-085587be6b65.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('tnt-small-p16_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('tnt-small-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/tnt/tnt-s-p16_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                           |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config               |                                        Download                                        |
+| :------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------: | :------------------------------------------------------------------------------------: |
+| `tnt-small-p16_3rdparty_in1k`\* | From scratch |   23.76    |   3.36    |   81.52   |   95.73   | [config](tnt-s-p16_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/contrastive/pytorch-image-models/blob/809271b0f3e5d9be4e11c0c5cec1dbba8b5e2c60/timm/models/tnt.py#L144). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{han2021transformer,
+      title={Transformer in Transformer},
+      author={Kai Han and An Xiao and Enhua Wu and Jianyuan Guo and Chunjing Xu and Yunhe Wang},
+      year={2021},
+      eprint={2103.00112},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/tnt/metafile.yml b/configs/tnt/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..dcc2eddb5f479b987767802447cd46fa2a6383bb
--- /dev/null
+++ b/configs/tnt/metafile.yml
@@ -0,0 +1,29 @@
+Collections:
+  - Name: Transformer in Transformer
+    Metadata:
+      Training Data: ImageNet-1k
+    Paper:
+      URL: https://arxiv.org/abs/2103.00112
+      Title: "Transformer in Transformer"
+    README: configs/tnt/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/tnt.py#L203
+      Version: v0.15.0
+
+Models:
+  - Name: tnt-small-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 3360000000
+      Parameters: 23760000
+    In Collection: Transformer in Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.52
+          Top 5 Accuracy: 95.73
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth
+    Config: configs/tnt/tnt-s-p16_16xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/contrastive/pytorch-image-models/releases/download/TNT/tnt_s_patch16_224.pth.tar
+      Code: https://github.com/contrastive/pytorch-image-models/blob/809271b0f3e5d9be4e11c0c5cec1dbba8b5e2c60/timm/models/tnt.py#L144
diff --git a/configs/tnt/tnt-s-p16_16xb64_in1k.py b/configs/tnt/tnt-s-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..af71232f831089a934d14beb4b187432661921ae
--- /dev/null
+++ b/configs/tnt/tnt-s-p16_16xb64_in1k.py
@@ -0,0 +1,56 @@
+# accuracy_top-1 : 81.52 accuracy_top-5 : 95.73
+_base_ = [
+    '../_base_/models/tnt_s_patch16_224.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=64)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-3, weight_decay=0.05))
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', T_max=295, by_epoch=True, begin=5, end=300)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (16 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/twins/README.md b/configs/twins/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9e97b7842d9ddb8ab12d13283fb3ed50ed172f70
--- /dev/null
+++ b/configs/twins/README.md
@@ -0,0 +1,80 @@
+# Twins
+
+> [Twins: Revisiting the Design of Spatial Attention in Vision Transformers](http://arxiv-export-lb.library.cornell.edu/abs/2104.13840)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks, including image level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is released at [this https URL](https://github.com/Meituan-AutoML/Twins).
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24582831/145021310-57826cf5-5e03-4c7c-9081-ffa744bdae27.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('twins-pcpvt-small_3rdparty_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('twins-pcpvt-small_3rdparty_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/twins/twins-pcpvt-small_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-small_3rdparty_8xb128_in1k_20220126-ef23c132.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                      |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                   |                              Download                               |
+| :----------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :-----------------------------------------------------------------: |
+| `twins-pcpvt-small_3rdparty_8xb128_in1k`\* | From scratch |   24.11    |   3.67    |   81.14   |   95.69   | [config](twins-pcpvt-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-small_3rdparty_8xb128_in1k_20220126-ef23c132.pth) |
+| `twins-pcpvt-base_3rdparty_8xb128_in1k`\*  | From scratch |   43.83    |   6.45    |   82.66   |   96.26   | [config](twins-pcpvt-base_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-base_3rdparty_8xb128_in1k_20220126-f8c4b0d5.pth) |
+| `twins-pcpvt-large_3rdparty_16xb64_in1k`\* | From scratch |   60.99    |   9.51    |   83.09   |   96.59   | [config](twins-pcpvt-large_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-large_3rdparty_16xb64_in1k_20220126-c1ef8d80.pth) |
+| `twins-svt-small_3rdparty_8xb128_in1k`\*   | From scratch |   24.06    |   2.82    |   81.77   |   95.57   |  [config](twins-svt-small_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-small_3rdparty_8xb128_in1k_20220126-8fe5205b.pth) |
+| `twins-svt-base_8xb128_3rdparty_in1k`\*    | From scratch |   56.07    |   8.35    |   83.13   |   96.29   |  [config](twins-svt-base_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-base_3rdparty_8xb128_in1k_20220126-e31cc8e9.pth) |
+| `twins-svt-large_3rdparty_16xb64_in1k`\*   | From scratch |   99.27    |   14.82   |   83.60   |   96.50   |  [config](twins-svt-large_16xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-large_3rdparty_16xb64_in1k_20220126-4817645f.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{chu2021twins,
+  title={Twins: Revisiting spatial attention design in vision transformers},
+  author={Chu, Xiangxiang and Tian, Zhi and Wang, Yuqing and Zhang, Bo and Ren, Haibing and Wei, Xiaolin and Xia, Huaxia and Shen, Chunhua},
+  journal={arXiv preprint arXiv:2104.13840},
+  year={2021}altgvt
+}
+```
diff --git a/configs/twins/metafile.yml b/configs/twins/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..d0d8ff4a324b86865b711b48d769a1f8fdb9130c
--- /dev/null
+++ b/configs/twins/metafile.yml
@@ -0,0 +1,114 @@
+Collections:
+  - Name: Twins
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Global Subsampled Attention
+        - Locally Grouped SelfAttention
+        - Conditional Position Encoding
+        - Pyramid Vision Transformer
+    Paper:
+      URL: http://arxiv-export-lb.library.cornell.edu/abs/2104.13840
+      Title: "Twins: Revisiting the Design of Spatial Attention in Vision Transformers"
+    README: configs/twins/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/twins.py
+      Version: v0.20.1
+
+Models:
+  - Name: twins-pcpvt-small_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 3670000000        # 3.67G
+      Parameters: 24110000     # 24.11M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.14
+          Top 5 Accuracy: 95.69
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-small_3rdparty_8xb128_in1k_20220126-ef23c132.pth
+    Config: configs/twins/twins-pcpvt-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-pcpvt-base_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 6450000000        # 6.45G
+      Parameters: 43830000     # 43.83M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.66
+          Top 5 Accuracy: 96.26
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-base_3rdparty_8xb128_in1k_20220126-f8c4b0d5.pth
+    Config: configs/twins/twins-pcpvt-base_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-pcpvt-large_3rdparty_16xb64_in1k
+    Metadata:
+      FLOPs: 9510000000           # 9.51G
+      Parameters: 60990000        # 60.99M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.09
+          Top 5 Accuracy: 96.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-large_3rdparty_16xb64_in1k_20220126-c1ef8d80.pth
+    Config: configs/twins/twins-pcpvt-large_16xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-svt-small_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 2820000000            # 2.82G
+      Parameters: 24060000         # 24.06M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.77
+          Top 5 Accuracy: 95.57
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-small_3rdparty_8xb128_in1k_20220126-8fe5205b.pth
+    Config: configs/twins/twins-svt-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-svt-base_8xb128_3rdparty_in1k
+    Metadata:
+      FLOPs: 8350000000           # 8.35G
+      Parameters: 56070000        # 56.07M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.13
+          Top 5 Accuracy: 96.29
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-base_3rdparty_8xb128_in1k_20220126-e31cc8e9.pth
+    Config: configs/twins/twins-svt-base_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-svt-large_3rdparty_16xb64_in1k
+    Metadata:
+      FLOPs: 14820000000          # 14.82G
+      Parameters: 99270000        # 99.27M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.60
+          Top 5 Accuracy: 96.50
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-large_3rdparty_16xb64_in1k_20220126-4817645f.pth
+    Config: configs/twins/twins-svt-large_16xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
diff --git a/configs/twins/twins-pcpvt-base_8xb128_in1k.py b/configs/twins/twins-pcpvt-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ac5d2adf15e4c71af8cff09a59acaa9d863f9a7
--- /dev/null
+++ b/configs/twins/twins-pcpvt-base_8xb128_in1k.py
@@ -0,0 +1,41 @@
+_base_ = [
+    '../_base_/models/twins_pcpvt_base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 128 * 8 / 512,  # learning rate for 128 batch size, 8 gpu.
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(_delete=True, norm_decay_mult=0.0, bias_decay_mult=0.0),
+    clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
diff --git a/configs/twins/twins-pcpvt-large_16xb64_in1k.py b/configs/twins/twins-pcpvt-large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0acfd7528b5c17ece73586df3ce7dc850ea5a64a
--- /dev/null
+++ b/configs/twins/twins-pcpvt-large_16xb64_in1k.py
@@ -0,0 +1,7 @@
+_base_ = ['twins-pcpvt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='large'), head=dict(in_channels=512))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
diff --git a/configs/twins/twins-pcpvt-small_8xb128_in1k.py b/configs/twins/twins-pcpvt-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fe763b77754bf249030d48459302e532900a1a3
--- /dev/null
+++ b/configs/twins/twins-pcpvt-small_8xb128_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['twins-pcpvt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='small'), head=dict(in_channels=512))
diff --git a/configs/twins/twins-svt-base_8xb128_in1k.py b/configs/twins/twins-svt-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d24f63b074afe59574d04e40f8379ec6c386baa
--- /dev/null
+++ b/configs/twins/twins-svt-base_8xb128_in1k.py
@@ -0,0 +1,41 @@
+_base_ = [
+    '../_base_/models/twins_svt_base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 128 * 8 / 512,  # learning rate for 128 batch size, 8 gpu.
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(_delete=True, norm_decay_mult=0.0, bias_decay_mult=0.0),
+    clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
diff --git a/configs/twins/twins-svt-large_16xb64_in1k.py b/configs/twins/twins-svt-large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a1eba894e5f831376ad7c5871434db438db59b
--- /dev/null
+++ b/configs/twins/twins-svt-large_16xb64_in1k.py
@@ -0,0 +1,7 @@
+_base_ = ['twins-svt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='large'), head=dict(in_channels=1024))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
diff --git a/configs/twins/twins-svt-small_8xb128_in1k.py b/configs/twins/twins-svt-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ffe267b56e921abcdcc40c833bba42e9952a4d4
--- /dev/null
+++ b/configs/twins/twins-svt-small_8xb128_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['twins-svt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='small'), head=dict(in_channels=512))
diff --git a/configs/van/README.md b/configs/van/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7e548b6b8003169602ea6a205c2c305b8808ed39
--- /dev/null
+++ b/configs/van/README.md
@@ -0,0 +1,78 @@
+# Visual-Attention-Network
+
+> [Visual Attention Network](https://arxiv.org/abs/2202.09741)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/157409411-2f622ba7-553c-4702-91be-eba03f9ea04f.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('van-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('van-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/van/van-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                       |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config               |                                          Download                                          |
+| :-------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------: | :----------------------------------------------------------------------------------------: |
+| `van-tiny_3rdparty_in1k`\*  | From scratch |    4.11    |   0.88    |   75.41   |   93.02   | [config](van-tiny_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth) |
+| `van-small_3rdparty_in1k`\* | From scratch |   13.86    |   2.52    |   81.01   |   95.63   | [config](van-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/van/van-small_8xb128_in1k_20220501-17bc91aa.pth) |
+| `van-base_3rdparty_in1k`\*  | From scratch |   26.58    |   5.03    |   82.80   |   96.21   | [config](van-base_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/van/van-base_8xb128_in1k_20220501-6a4cc31b.pth) |
+| `van-large_3rdparty_in1k`\* | From scratch |   44.77    |   8.99    |   83.86   |   96.73   | [config](van-large_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/van/van-large_8xb128_in1k_20220501-f212ba21.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Visual-Attention-Network/VAN-Classification). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{guo2022visual,
+  title={Visual Attention Network},
+  author={Guo, Meng-Hao and Lu, Cheng-Ze and Liu, Zheng-Ning and Cheng, Ming-Ming and Hu, Shi-Min},
+  journal={arXiv preprint arXiv:2202.09741},
+  year={2022}
+}
+```
diff --git a/configs/van/metafile.yml b/configs/van/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..db5a6e6443c13a1eb9dc669923d8c0902e89ee7a
--- /dev/null
+++ b/configs/van/metafile.yml
@@ -0,0 +1,82 @@
+Collections:
+  - Name: Visual-Attention-Network
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Architecture:
+        - Visual Attention Network
+    Paper:
+      URL: https://arxiv.org/abs/2202.09741
+      Title: "Visual Attention Network"
+    README: configs/van/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.0/mmcls/models/backbones/van.py
+      Version: v0.23.0
+
+Models:
+  - Name: van-tiny_3rdparty_in1k
+    Metadata:
+      Parameters: 4110000      # 4.11M
+      FLOPs: 880000000   # 0.88G
+    In Collection: Visual-Attention-Network
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 75.41
+          Top 5 Accuracy: 93.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth
+    Config: configs/van/van-tiny_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/Visual-Attention-Network/VAN-Classification
+      Weights: https://cloud.tsinghua.edu.cn/f/aada2242a16245d6a561/?dl=1
+  - Name: van-small_3rdparty_in1k
+    Metadata:
+      Parameters:  13860000          # 13.86M
+      FLOPs: 2520000000    # 2.52G
+    In Collection: Visual-Attention-Network
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 81.01
+            Top 5 Accuracy: 95.63
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/van/van-small_8xb128_in1k_20220501-17bc91aa.pth
+    Config: configs/van/van-small_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/Visual-Attention-Network/VAN-Classification
+      Weights: https://cloud.tsinghua.edu.cn/f/dd3eb73692f74a2499c9/?dl=1
+  - Name: van-base_3rdparty_in1k
+    Metadata:
+      Parameters: 26580000            # 26.58M
+      FLOPs: 5030000000                # 5.03G
+    In Collection: Visual-Attention-Network
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 82.80
+            Top 5 Accuracy: 96.21
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/van/van-base_8xb128_in1k_20220501-6a4cc31b.pth
+    Config: configs/van/van-base_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/Visual-Attention-Network/VAN-Classification
+      Weights: https://cloud.tsinghua.edu.cn/f/58e7acceaf334ecdba89/?dl=1
+  - Name: van-large_3rdparty_in1k
+    Metadata:
+      Parameters: 44770000              # 44.77 M
+      FLOPs: 8990000000              # 8.99G
+    In Collection: Visual-Attention-Network
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 83.86
+            Top 5 Accuracy: 96.73
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/van/van-large_8xb128_in1k_20220501-f212ba21.pth
+    Config: configs/van/van-large_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/Visual-Attention-Network/VAN-Classification
+      Weights: https://cloud.tsinghua.edu.cn/f/0201745f6920482490a0/?dl=1
diff --git a/configs/van/van-base_8xb128_in1k.py b/configs/van/van-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..47082b748554eea9dfc467f63a5644294131fd14
--- /dev/null
+++ b/configs/van/van-base_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/van/van_base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/van/van-large_8xb128_in1k.py b/configs/van/van-large_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b16567726222306eff4a28ef76361922ecf28970
--- /dev/null
+++ b/configs/van/van-large_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/van/van_large.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/van/van-small_8xb128_in1k.py b/configs/van/van-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbbbbdf4c8b7441a19c00c44f012478b1021335a
--- /dev/null
+++ b/configs/van/van-small_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/van/van_small.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/van/van-tiny_8xb128_in1k.py b/configs/van/van-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ac62dab083c5c42dfd532f9191f01c74fcc9408
--- /dev/null
+++ b/configs/van/van-tiny_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/van/van_tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/vgg/README.md b/configs/vgg/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7af69ce6b87d1ce989881fa17bf5c6cacc3748be
--- /dev/null
+++ b/configs/vgg/README.md
@@ -0,0 +1,86 @@
+# VGG
+
+> [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142578905-9be586ec-f6fd-4bfb-bbba-432f599d3b9b.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vgg11_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vgg11_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/vgg/vgg11_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/vgg/vgg11_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |             Config              |                                               Download                                               |
+| :------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------: | :--------------------------------------------------------------------------------------------------: |
+| `vgg11_8xb32_in1k`   | From scratch |   132.86   |   7.63    |   68.75   |   88.87   |  [config](vgg11_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.json) |
+| `vgg13_8xb32_in1k`   | From scratch |   133.05   |   11.34   |   70.02   |   89.46   |  [config](vgg13_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080.json) |
+| `vgg16_8xb32_in1k`   | From scratch |   138.36   |   15.50   |   71.62   |   90.49   |  [config](vgg16_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.json) |
+| `vgg19_8xb32_in1k`   | From scratch |   143.67   |   19.67   |   72.41   |   90.80   |  [config](vgg19_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a.json) |
+| `vgg11bn_8xb32_in1k` | From scratch |   132.87   |   7.64    |   70.67   |   90.16   | [config](vgg11bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c.json) |
+| `vgg13bn_8xb32_in1k` | From scratch |   133.05   |   11.36   |   72.12   |   90.66   | [config](vgg13bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864.json) |
+| `vgg16bn_8xb32_in1k` | From scratch |   138.37   |   15.53   |   73.74   |   91.66   | [config](vgg16bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29.json) |
+| `vgg19bn_8xb32_in1k` | From scratch |   143.68   |   19.70   |   74.68   |   92.27   | [config](vgg19bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f.json) |
+
+## Citation
+
+```bibtex
+@article{simonyan2014very,
+  title={Very deep convolutional networks for large-scale image recognition},
+  author={Simonyan, Karen and Zisserman, Andrew},
+  journal={arXiv preprint arXiv:1409.1556},
+  year={2014}
+}
+```
diff --git a/configs/vgg/metafile.yml b/configs/vgg/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..ce3af191a746878f7d9b6febf67cc6c96a5fa8c1
--- /dev/null
+++ b/configs/vgg/metafile.yml
@@ -0,0 +1,125 @@
+Collections:
+  - Name: VGG
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x Xp GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - VGG
+    Paper:
+      URL: https://arxiv.org/abs/1409.1556
+      Title: "Very Deep Convolutional Networks for Large-Scale Image Recognition"
+    README: configs/vgg/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/vgg.py#L39
+      Version: v0.15.0
+
+Models:
+  - Name: vgg11_8xb32_in1k
+    Metadata:
+      FLOPs: 7630000000
+      Parameters: 132860000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 68.75
+          Top 5 Accuracy: 88.87
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth
+    Config: configs/vgg/vgg11_8xb32_in1k.py
+  - Name: vgg13_8xb32_in1k
+    Metadata:
+      FLOPs: 11340000000
+      Parameters: 133050000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.02
+          Top 5 Accuracy: 89.46
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080.pth
+    Config: configs/vgg/vgg13_8xb32_in1k.py
+  - Name: vgg16_8xb32_in1k
+    Metadata:
+      FLOPs: 15500000000
+      Parameters: 138360000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.62
+          Top 5 Accuracy: 90.49
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.pth
+    Config: configs/vgg/vgg16_8xb32_in1k.py
+  - Name: vgg19_8xb32_in1k
+    Metadata:
+      FLOPs: 19670000000
+      Parameters: 143670000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.41
+          Top 5 Accuracy: 90.8
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a.pth
+    Config: configs/vgg/vgg19_8xb32_in1k.py
+  - Name: vgg11bn_8xb32_in1k
+    Metadata:
+      FLOPs: 7640000000
+      Parameters: 132870000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.67
+          Top 5 Accuracy: 90.16
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c.pth
+    Config: configs/vgg/vgg11bn_8xb32_in1k.py
+  - Name: vgg13bn_8xb32_in1k
+    Metadata:
+      FLOPs: 11360000000
+      Parameters: 133050000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.12
+          Top 5 Accuracy: 90.66
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864.pth
+    Config: configs/vgg/vgg13bn_8xb32_in1k.py
+  - Name: vgg16bn_8xb32_in1k
+    Metadata:
+      FLOPs: 15530000000
+      Parameters: 138370000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.74
+          Top 5 Accuracy: 91.66
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29.pth
+    Config: configs/vgg/vgg16bn_8xb32_in1k.py
+  - Name: vgg19bn_8xb32_in1k
+    Metadata:
+      FLOPs: 19700000000
+      Parameters: 143680000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.68
+          Top 5 Accuracy: 92.27
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f.pth
+    Config: configs/vgg/vgg19bn_8xb32_in1k.py
diff --git a/configs/vgg/vgg11_8xb32_in1k.py b/configs/vgg/vgg11_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..616233c418fdeaa5d08db75b290f3438ec96b13c
--- /dev/null
+++ b/configs/vgg/vgg11_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/vgg11.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg11bn_8xb32_in1k.py b/configs/vgg/vgg11bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..22f55ef0851ee4728caad271cfdaf02fb5c4afed
--- /dev/null
+++ b/configs/vgg/vgg11bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vgg11bn.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vgg/vgg13_8xb32_in1k.py b/configs/vgg/vgg13_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec1c98fb997568754868670a0f9d37233e6ca57d
--- /dev/null
+++ b/configs/vgg/vgg13_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/vgg13.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg13bn_8xb32_in1k.py b/configs/vgg/vgg13bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3cb3592b09e06e1b902c6d1fcca2cb03bcb7f82c
--- /dev/null
+++ b/configs/vgg/vgg13bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vgg13bn.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vgg/vgg16_8xb16_voc.py b/configs/vgg/vgg16_8xb16_voc.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d9e347bf533f36eb165dd06d0faf20ccbaba917
--- /dev/null
+++ b/configs/vgg/vgg16_8xb16_voc.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/datasets/voc_bs16.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+
+# load model pretrained on imagenet
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.pth'  # noqa
+
+# use different head for multilabel task
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG',
+        depth=16,
+        num_classes=20,
+        init_cfg=dict(
+            type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+    neck=None,
+    head=dict(
+        type='MultiLabelClsHead',
+        loss=dict(type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.001, momentum=0.9, weight_decay=0),
+    # update the final linear by 10 times learning rate.
+    paramwise_cfg=dict(custom_keys={'.backbone.classifier': dict(lr_mult=10)}),
+)
+
+# learning policy
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=20, gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=40, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (16 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/vgg/vgg16_8xb32_in1k.py b/configs/vgg/vgg16_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a291da2813f011323f7ba19724dc92d87b935f80
--- /dev/null
+++ b/configs/vgg/vgg16_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/vgg16.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg16bn_8xb32_in1k.py b/configs/vgg/vgg16bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6bbb81b86b279bbf84d7b877ef3bc370dedbf4e
--- /dev/null
+++ b/configs/vgg/vgg16bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vgg16bn.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vgg/vgg19_8xb32_in1k.py b/configs/vgg/vgg19_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..88cd24c1dd9cb28dc3c91e4403b241c441dfbe03
--- /dev/null
+++ b/configs/vgg/vgg19_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/vgg19.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg19bn_8xb32_in1k.py b/configs/vgg/vgg19bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b4f34aba0ad5f665b86a8173af9e4436546af23
--- /dev/null
+++ b/configs/vgg/vgg19bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vgg19bn.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/README.md b/configs/vig/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..624e387ac3799f599cbd886e9053cfa1d2a2de95
--- /dev/null
+++ b/configs/vig/README.md
@@ -0,0 +1,81 @@
+# VIG
+
+> [Vision GNN: An Image is Worth Graph of Nodes](https://arxiv.org/abs/2206.00272)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/212789461-f085e4da-9ce9-435f-93c0-e1b84d10b79f.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vig-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vig-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/vig/vig-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/vig/vig-tiny_3rdparty_in1k_20230117-6414c684.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                        Download                                        |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `vig-tiny_3rdparty_in1k`\*    | From scratch |    7.18    |   1.31    |   74.40   |   92.34   |  [config](vig-tiny_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/vig/vig-tiny_3rdparty_in1k_20230117-6414c684.pth) |
+| `vig-small_3rdparty_in1k`\*   | From scratch |   22.75    |   4.54    |   80.61   |   95.28   |  [config](vig-small_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vig/vig-small_3rdparty_in1k_20230117-5338bf3b.pth) |
+| `vig-base_3rdparty_in1k`\*    | From scratch |   20.68    |   17.68   |   82.62   |   96.04   |  [config](vig-base_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/vig/vig-base_3rdparty_in1k_20230117-92f6f12f.pth) |
+| `pvig-tiny_3rdparty_in1k`\*   | From scratch |    9.46    |   1.71    |   78.38   |   94.38   |  [config](pvig-tiny_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-tiny_3rdparty_in1k_20230117-eb77347d.pth) |
+| `pvig-small_3rdparty_in1k`\*  | From scratch |   29.02    |   4.57    |   82.00   |   95.97   | [config](pvig-small_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-small_3rdparty_in1k_20230117-9433dc96.pth) |
+| `pvig-medium_3rdparty_in1k`\* | From scratch |   51.68    |   8.89    |   83.12   |   96.35   | [config](pvig-medium_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-medium_3rdparty_in1k_20230117-21057a6d.pth) |
+| `pvig-base_3rdparty_in1k`\*   | From scratch |   95.21    |   16.86   |   83.59   |   96.52   |  [config](pvig-base_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-base_3rdparty_in1k_20230117-dbab3c85.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{han2022vig,
+  title={Vision GNN: An Image is Worth Graph of Nodes},
+  author={Kai Han and Yunhe Wang and Jianyuan Guo and Yehui Tang and Enhua Wu},
+  booktitle={NeurIPS},
+  year={2022}
+}
+```
diff --git a/configs/vig/metafile.yml b/configs/vig/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..52bd18baf1623bf1f12a95d93c331749847a1339
--- /dev/null
+++ b/configs/vig/metafile.yml
@@ -0,0 +1,134 @@
+Collections:
+  - Name: VIG
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Vision GNN
+    Paper:
+      Title: 'Vision GNN: An Image is Worth Graph of Nodes'
+      URL: https://arxiv.org/abs/2206.00272
+    README: configs/vig/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: vig-tiny_3rdparty_in1k
+    Metadata:
+      FLOPs: 1309000000
+      Parameters: 7185000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.40
+          Top 5 Accuracy: 92.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/vig-tiny_3rdparty_in1k_20230117-6414c684.pth
+    Config: configs/vig/vig-tiny_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/vig/vig_ti_74.5.pth
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: vig-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 4535000000
+      Parameters: 22748000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.61
+          Top 5 Accuracy: 95.28
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/vig-small_3rdparty_in1k_20230117-5338bf3b.pth
+    Config: configs/vig/vig-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/vig/vig_s_80.6.pth
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: vig-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 17681000000
+      Parameters: 20685000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.62
+          Top 5 Accuracy: 96.04
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/vig-base_3rdparty_in1k_20230117-92f6f12f.pth
+    Config: configs/vig/vig-base_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/vig/vig_b_82.6.pth
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: pvig-tiny_3rdparty_in1k
+    Metadata:
+      FLOPs: 1714000000
+      Parameters: 9458000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.38
+          Top 5 Accuracy: 94.38
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-tiny_3rdparty_in1k_20230117-eb77347d.pth
+    Config: configs/vig/pvig-tiny_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_ti_78.5.pth.tar
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: pvig-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 4572000000
+      Parameters: 29024000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.00
+          Top 5 Accuracy: 95.97
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-small_3rdparty_in1k_20230117-9433dc96.pth
+    Config: configs/vig/pvig-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_s_82.1.pth.tar
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: pvig-medium_3rdparty_in1k
+    Metadata:
+      FLOPs: 8886000000
+      Parameters: 51682000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.12
+          Top 5 Accuracy: 96.35
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-medium_3rdparty_in1k_20230117-21057a6d.pth
+    Config: configs/vig/pvig-medium_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_m_83.1.pth.tar
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: pvig-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 16861000000
+      Parameters: 95213000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.59
+          Top 5 Accuracy: 96.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-base_3rdparty_in1k_20230117-dbab3c85.pth
+    Config: configs/vig/pvig-base_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_b_83.66.pth.tar
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
diff --git a/configs/vig/pvig-base_8xb128_in1k.py b/configs/vig/pvig-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d66359c6c78068e48e0466fede86f11e14e9a91
--- /dev/null
+++ b/configs/vig/pvig-base_8xb128_in1k.py
@@ -0,0 +1,22 @@
+_base_ = [
+    '../_base_/models/vig/pyramid_vig_base.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=235,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/vig/pvig-medium_8xb128_in1k.py b/configs/vig/pvig-medium_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..75c25a2d89b0b8fce8d816d0129afeaf63d6a5e2
--- /dev/null
+++ b/configs/vig/pvig-medium_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/pyramid_vig_medium.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/pvig-small_8xb128_in1k.py b/configs/vig/pvig-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..755b3319d313f02ce9f1c2f2a943ddd934f7e49b
--- /dev/null
+++ b/configs/vig/pvig-small_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/pyramid_vig_small.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/pvig-tiny_8xb128_in1k.py b/configs/vig/pvig-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a885559c597962201bed20249f8b688589a7788
--- /dev/null
+++ b/configs/vig/pvig-tiny_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/pyramid_vig_tiny.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/vig-base_8xb128_in1k.py b/configs/vig/vig-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb8b55e3e841659f65e975947a9859361e34aa28
--- /dev/null
+++ b/configs/vig/vig-base_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/vig_base.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/vig-small_8xb128_in1k.py b/configs/vig/vig-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..41508b2894d0849cfc92dd2340c71bebdf06f591
--- /dev/null
+++ b/configs/vig/vig-small_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/vig_small.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/vig-tiny_8xb128_in1k.py b/configs/vig/vig-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..80b1693ad5baecd57d450ae33806e80ddce0f55e
--- /dev/null
+++ b/configs/vig/vig-tiny_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/vig_tiny.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vision_transformer/README.md b/configs/vision_transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..66bd3f529dd85062323c585b38660ab414362250
--- /dev/null
+++ b/configs/vision_transformer/README.md
@@ -0,0 +1,101 @@
+# Vision Transformer
+
+> [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**Vision Transformer**, known as **ViT**, succeeded in using a full transformer to outperform previous works that based on convolutional networks in vision field. ViT splits image into patches to feed the multi-head attentions, concatenates a learnable class token for final prediction and adds a learnable position embeddings for relative positional message between patches. Based on these three techniques with attentions, ViT provides a brand-new pattern to build a basic structure in vision field.
+
+The strategy works even better when coupled with large datasets pre-trainings. Because of its simplicity and effectiveness, some after works in classification field are originated from ViT. And even in recent multi-modality field, ViT-based method still plays a role in it.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142579081-b5718032-6581-472b-8037-ea66aaa9e278.png" width="70%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+
+While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p32_in21k-pre_3rdparty_in1k-384px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-base-p32_in21k-pre_3rdparty_in1k-384px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                           |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                    Config                    |                           Download                           |
+| :---------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------: | :----------------------------------------------------------: |
+| `vit-base-p32_in21k-pre_3rdparty_in1k-384px`\*  | ImageNet-21k |   88.30    |   13.06   |   84.01   |   97.08   | [config](vit-base-p32_64xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth) |
+| `vit-base-p16_32xb128-mae_in1k`                 | From scratch |   86.57    |   17.58   |   82.37   |   96.15   |  [config](vit-base-p16_32xb128-mae_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-mae_in1k_20220623-4c544545.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-mae_in1k_20220623-4c544545.log) |
+| `vit-base-p16_in21k-pre_3rdparty_in1k-384px`\*  | ImageNet-21k |   86.86    |   55.54   |   85.43   |   97.77   | [config](vit-base-p16_64xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth) |
+| `vit-large-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   304.72   |  191.21   |   85.63   |   97.63   | [config](vit-large-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-b20ba619.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{
+  dosovitskiy2021an,
+  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
+  author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
+  booktitle={International Conference on Learning Representations},
+  year={2021},
+  url={https://openreview.net/forum?id=YicbFdNTTy}
+}
+```
diff --git a/configs/vision_transformer/metafile.yml b/configs/vision_transformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..891c413ab6c5b579eb5d404b7b7e7d01fe94b8d8
--- /dev/null
+++ b/configs/vision_transformer/metafile.yml
@@ -0,0 +1,95 @@
+Collections:
+  - Name: Vision Transformer
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'An Image is Worth 16x16 Words: Transformers for Image Recognition at
+        Scale'
+      URL: https://arxiv.org/abs/2010.11929
+    README: configs/vision_transformer/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.17.0/mmcls/models/backbones/vision_transformer.py
+      Version: v0.17.0
+
+Models:
+  - Name: vit-base-p32_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 13056716544
+      Parameters: 88297192
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: Vision Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 84.01
+          Top 5 Accuracy: 97.08
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth
+    Config: configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_32-i21k-300ep-lr_0.001-aug_light1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
+      Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
+  - Name: vit-base-p16_32xb128-mae_in1k
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86567656
+      Training Data:
+        - ImageNet-1k
+    In Collection: Vision Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 82.37
+          Top 5 Accuracy: 96.15
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-mae_in1k_20220623-4c544545.pth
+    Config: configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
+  - Name: vit-base-p16_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 55538974464
+      Parameters: 86859496
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: Vision Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.43
+          Top 5 Accuracy: 97.77
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth
+    Config: configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz
+      Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
+  - Name: vit-large-p16_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 191210034176
+      Parameters: 304715752
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: Vision Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.63
+          Top 5 Accuracy: 97.63
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-b20ba619.pth
+    Config: configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/L_16-i21k-300ep-lr_0.001-aug_strong1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
+      Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
diff --git a/configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py b/configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a46bbb21a99b34f792f277759b4dccb75c88b2ed
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=1e-4 * 4096 / 256,
+        weight_decay=0.3,
+        eps=1e-8,
+        betas=(0.9, 0.95)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=1e-4)]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/vision_transformer/vit-base-p16_4xb544-ipu_in1k.py b/configs/vision_transformer/vit-base-p16_4xb544-ipu_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d378b3b265b30b7f3e492dcf22527fed5cd9beb4
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_4xb544-ipu_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/default_runtime.py'
+]
+
+# specific to vit pretrain
+paramwise_cfg = dict(custom_keys={
+    '.cls_token': dict(decay_mult=0.0),
+    '.pos_embed': dict(decay_mult=0.0)
+})
+
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-base-p16_3rdparty_pt-64xb64_in1k-224_20210928-02284250.pth'  # noqa
+
+model = dict(
+    head=dict(
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, _delete_=True), ),
+    backbone=dict(
+        img_size=224,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=pretrained,
+            _delete_=True,
+            prefix='backbone')))
+
+img_norm_cfg = dict(
+    mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='ToHalf', keys=['img']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(224, -1), keep_ratio=True, backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToHalf', keys=['img']),
+    dict(type='Collect', keys=['img'])
+]
+
+# change batch size
+data = dict(
+    samples_per_gpu=17,
+    workers_per_gpu=16,
+    drop_last=True,
+    train=dict(pipeline=train_pipeline),
+    train_dataloader=dict(mode='async'),
+    val=dict(pipeline=test_pipeline, ),
+    val_dataloader=dict(samples_per_gpu=4, workers_per_gpu=1),
+    test=dict(pipeline=test_pipeline),
+    test_dataloader=dict(samples_per_gpu=4, workers_per_gpu=1))
+
+# optimizer
+optimizer = dict(
+    type='SGD',
+    lr=0.08,
+    weight_decay=1e-5,
+    momentum=0.9,
+    paramwise_cfg=paramwise_cfg,
+)
+
+# learning policy
+param_scheduler = [
+    dict(type='LinearLR', start_factor=0.02, by_epoch=False, begin=0, end=800),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=4200,
+        by_epoch=False,
+        begin=800,
+        end=5000)
+]
+
+# ipu cfg
+# model partition config
+ipu_model_cfg = dict(
+    train_split_edges=[
+        dict(layer_to_call='backbone.patch_embed', ipu_id=0),
+        dict(layer_to_call='backbone.layers.3', ipu_id=1),
+        dict(layer_to_call='backbone.layers.6', ipu_id=2),
+        dict(layer_to_call='backbone.layers.9', ipu_id=3)
+    ],
+    train_ckpt_nodes=['backbone.layers.{}'.format(i) for i in range(12)])
+
+# device config
+options_cfg = dict(
+    randomSeed=42,
+    partialsType='half',
+    train_cfg=dict(
+        executionStrategy='SameAsIpu',
+        Training=dict(gradientAccumulation=32),
+        availableMemoryProportion=[0.3, 0.3, 0.3, 0.3],
+    ),
+    eval_cfg=dict(deviceIterations=1, ),
+)
+
+# add model partition config and device config to runner
+runner = dict(
+    type='IterBasedRunner',
+    ipu_model_cfg=ipu_model_cfg,
+    options_cfg=options_cfg,
+    max_iters=5000)
+
+default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=1000))
+
+fp16 = dict(loss_scale=256.0, velocity_accum_type='half', accum_type='half')
diff --git a/configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py b/configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0f745874bcef7e3896cfc694c16bf4e5a235fae
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p16_64xb64_in1k.py b/configs/vision_transformer/vit-base-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..07be0e9a373a324f07989476314d391f2fee4f8e
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p16_8xb64-lora_in1k-384px.py b/configs/vision_transformer/vit-base-p16_8xb64-lora_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffe1018e5d9c0f724911b782a555cb34d50d6ceb
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_8xb64-lora_in1k-384px.py
@@ -0,0 +1,84 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='LoRAModel',
+        module=dict(
+            type='VisionTransformer',
+            arch='b',
+            img_size=384,
+            patch_size=16,
+            drop_rate=0.1,
+            init_cfg=dict(type='Pretrained', checkpoint='',
+                          prefix='backbone')),
+        alpha=16,
+        rank=16,
+        drop_rate=0.1,
+        targets=[dict(type='qkv')]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1,
+            mode='classy_vision'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)],
+    ))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py b/configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5a4d14f4dad0759f70b9b9e29c085ad7eff292c
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p32_64xb64_in1k.py b/configs/vision_transformer/vit-base-p32_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cfc7c47df0887e4ace1bbaeb59bb5d42e004a83
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p32_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py b/configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..98e96ec68ffdaca2648e1ac2ae5a79db30ec8382
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+    '../_base_/models/vit-large-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p16_64xb64_in1k.py b/configs/vision_transformer/vit-large-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d9bd283b779af36df99574bbdde7701c6b41393
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p16_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/vit-large-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p32_64xb64_in1k-384px.py b/configs/vision_transformer/vit-large-p32_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..22320d119890bb80aca47e45322dabeee4d0feb7
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p32_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+    '../_base_/models/vit-large-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p32_64xb64_in1k.py b/configs/vision_transformer/vit-large-p32_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..61e179165b84d8aa521426aa992cc2460d7ae0a5
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p32_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/vit-large-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/wrn/README.md b/configs/wrn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2753307b06699b4235aaf1465f0ce5cf89a30952
--- /dev/null
+++ b/configs/wrn/README.md
@@ -0,0 +1,76 @@
+# Wide-ResNet
+
+> [Wide Residual Networks](https://arxiv.org/abs/1605.07146)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/156701329-2c7ec7bc-23da-401b-86bf-dea8567ccee8.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('wide-resnet50_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('wide-resnet50_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/wrn/wide-resnet50_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                      |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                   |                              Download                               |
+| :----------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :-----------------------------------------------------------------: |
+| `wide-resnet50_3rdparty_8xb32_in1k`\*      | From scratch |   68.88    |   11.44   |   78.48   |   94.08   |   [config](wide-resnet50_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth) |
+| `wide-resnet101_3rdparty_8xb32_in1k`\*     | From scratch |   126.89   |   22.81   |   78.84   |   94.28   |   [config](wide-resnet101_8xb32_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet101_3rdparty_8xb32_in1k_20220304-8d5f9d61.pth) |
+| `wide-resnet50_3rdparty-timm_8xb32_in1k`\* | From scratch |   68.88    |   11.44   |   81.45   |   95.53   | [config](wide-resnet50_timm_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty-timm_8xb32_in1k_20220304-83ae4399.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@INPROCEEDINGS{Zagoruyko2016WRN,
+    author = {Sergey Zagoruyko and Nikos Komodakis},
+    title = {Wide Residual Networks},
+    booktitle = {BMVC},
+    year = {2016}}
+```
diff --git a/configs/wrn/metafile.yml b/configs/wrn/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..75e346720cf626c923514e01a5bd3ed33849da9a
--- /dev/null
+++ b/configs/wrn/metafile.yml
@@ -0,0 +1,77 @@
+Collections:
+  - Name: Wide-ResNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - 1x1 Convolution
+        - Batch Normalization
+        - Convolution
+        - Global Average Pooling
+        - Max Pooling
+        - ReLU
+        - Residual Connection
+        - Softmax
+        - Wide Residual Block
+    Paper:
+      URL: https://arxiv.org/abs/1605.07146
+      Title: "Wide Residual Networks"
+    README: configs/wrn/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/resnet.py#L383
+      Version: v0.20.1
+
+Models:
+  - Name: wide-resnet50_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 11440000000  # 11.44G
+      Parameters: 68880000  # 68.88M
+    In Collection: Wide-ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.48
+          Top 5 Accuracy: 94.08
+    Weights: https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth
+    Config: configs/wrn/wide-resnet50_8xb32_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
+  - Name: wide-resnet101_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 22810000000  # 22.81G
+      Parameters: 126890000 # 126.89M
+    In Collection: Wide-ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.84
+          Top 5 Accuracy: 94.28
+    Weights: https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet101_3rdparty_8xb32_in1k_20220304-8d5f9d61.pth
+    Config: configs/wrn/wide-resnet101_8xb32_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/wide_resnet101_2-32ee1156.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
+  - Name: wide-resnet50_3rdparty-timm_8xb32_in1k
+    Metadata:
+      FLOPs: 11440000000  # 11.44G
+      Parameters: 68880000  # 68.88M
+    In Collection: Wide-ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.45
+          Top 5 Accuracy: 95.53
+    Weights: https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty-timm_8xb32_in1k_20220304-83ae4399.pth
+    Config: configs/wrn/wide-resnet50_timm_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/wide_resnet50_racm-8234f177.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py
diff --git a/configs/wrn/wide-resnet101_8xb32_in1k.py b/configs/wrn/wide-resnet101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1bf5e5e5fac3655bd27f64f4c5c5a1316403a3b
--- /dev/null
+++ b/configs/wrn/wide-resnet101_8xb32_in1k.py
@@ -0,0 +1,7 @@
+_base_ = [
+    '../_base_/models/wide-resnet50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(depth=101))
diff --git a/configs/wrn/wide-resnet50_8xb32_in1k.py b/configs/wrn/wide-resnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..edf6a0518ac73f4eaa54f261ecbfce8acf0f2035
--- /dev/null
+++ b/configs/wrn/wide-resnet50_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/wide-resnet50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/wrn/wide-resnet50_timm_8xb32_in1k.py b/configs/wrn/wide-resnet50_timm_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8dca8f37319f8d60df0e42123b2ebe16a3f7d9d8
--- /dev/null
+++ b/configs/wrn/wide-resnet50_timm_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/wide-resnet50.py',
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/xcit/README.md b/configs/xcit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ab2cd7a3634e4d877bca3d5125d3506d3861b428
--- /dev/null
+++ b/configs/xcit/README.md
@@ -0,0 +1,106 @@
+# XCiT
+
+> [XCiT: Cross-Covariance Image Transformers](https://arxiv.org/abs/2106.09681)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/218900814-64a44606-150b-4757-aec8-7015c77a9fd1.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('xcit-nano-12-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/xcit/xcit-nano-12-p16_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty_in1k_20230213-ed776c38.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                           | Params (M) | Flops (G) |                      Config                       |                                       Download                                        |
+| :---------------------------------------------- | :--------: | :-------: | :-----------------------------------------------: | :-----------------------------------------------------------------------------------: |
+| `xcit-nano-12-p16_3rdparty_in1k`\*              |    3.05    |   0.56    |     [config](xcit-nano-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty_in1k_20230213-ed776c38.pth) |
+| `xcit-nano-12-p16_3rdparty-dist_in1k`\*         |    3.05    |   0.56    |     [config](xcit-nano-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k_20230213-fb247f7b.pth) |
+| `xcit-tiny-12-p16_3rdparty_in1k`\*              |    6.72    |   1.24    |     [config](xcit-tiny-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty_in1k_20230213-82c547ca.pth) |
+| `xcit-tiny-12-p16_3rdparty-dist_in1k`\*         |    6.72    |   1.24    |     [config](xcit-tiny-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k_20230213-d5fde0a3.pth) |
+| `xcit-nano-12-p16_3rdparty-dist_in1k-384px`\*   |    3.05    |   1.64    |  [config](xcit-nano-12-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k-384px_20230213-712db4d4.pth) |
+| `xcit-nano-12-p8_3rdparty_in1k`\*               |    3.05    |   2.16    |     [config](xcit-nano-12-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty_in1k_20230213-3370c293.pth) |
+| `xcit-nano-12-p8_3rdparty-dist_in1k`\*          |    3.05    |   2.16    |     [config](xcit-nano-12-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k_20230213-2f87d2b3.pth) |
+| `xcit-tiny-24-p16_3rdparty_in1k`\*              |   12.12    |   2.34    |     [config](xcit-tiny-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty_in1k_20230213-366c1cd0.pth) |
+| `xcit-tiny-24-p16_3rdparty-dist_in1k`\*         |   12.12    |   2.34    |     [config](xcit-tiny-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k_20230213-b472e80a.pth) |
+| `xcit-tiny-12-p16_3rdparty-dist_in1k-384px`\*   |    6.72    |   3.64    |  [config](xcit-tiny-12-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k-384px_20230213-00a20023.pth) |
+| `xcit-tiny-12-p8_3rdparty_in1k`\*               |    6.71    |   4.81    |     [config](xcit-tiny-12-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty_in1k_20230213-8b02f8f5.pth) |
+| `xcit-tiny-12-p8_3rdparty-dist_in1k`\*          |    6.71    |   4.81    |     [config](xcit-tiny-12-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k_20230213-f3f9b44f.pth) |
+| `xcit-small-12-p16_3rdparty_in1k`\*             |   26.25    |   4.81    |    [config](xcit-small-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty_in1k_20230213-d36779d2.pth) |
+| `xcit-small-12-p16_3rdparty-dist_in1k`\*        |   26.25    |   4.81    |    [config](xcit-small-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k_20230213-c95bbae1.pth) |
+| `xcit-nano-12-p8_3rdparty-dist_in1k-384px`\*    |    3.05    |   6.34    |  [config](xcit-nano-12-p8_8xb128_in1k-384px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k-384px_20230213-09d925ef.pth) |
+| `xcit-tiny-24-p16_3rdparty-dist_in1k-384px`\*   |   12.12    |   6.87    |  [config](xcit-tiny-24-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k-384px_20230213-20e13917.pth) |
+| `xcit-small-24-p16_3rdparty_in1k`\*             |   47.67    |   9.10    |    [config](xcit-small-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty_in1k_20230213-40febe38.pth) |
+| `xcit-small-24-p16_3rdparty-dist_in1k`\*        |   47.67    |   9.10    |    [config](xcit-small-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k_20230213-130d7262.pth) |
+| `xcit-tiny-24-p8_3rdparty_in1k`\*               |   12.11    |   9.21    |     [config](xcit-tiny-24-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty_in1k_20230213-4b9ba392.pth) |
+| `xcit-tiny-24-p8_3rdparty-dist_in1k`\*          |   12.11    |   9.21    |     [config](xcit-tiny-24-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k_20230213-ad9c44b0.pth) |
+| `xcit-tiny-12-p8_3rdparty-dist_in1k-384px`\*    |    6.71    |   14.13   |  [config](xcit-tiny-12-p8_8xb128_in1k-384px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k-384px_20230213-a072174a.pth) |
+| `xcit-small-12-p16_3rdparty-dist_in1k-384px`\*  |   26.25    |   14.14   | [config](xcit-small-12-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k-384px_20230213-ba36c982.pth) |
+| `xcit-medium-24-p16_3rdparty_in1k`\*            |   84.40    |   16.13   |    [config](xcit-medium-24-p16_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty_in1k_20230213-ad0aa92e.pth) |
+| `xcit-medium-24-p16_3rdparty-dist_in1k`\*       |   84.40    |   16.13   |    [config](xcit-medium-24-p16_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k_20230213-aca5cd0c.pth) |
+| `xcit-small-12-p8_3rdparty_in1k`\*              |   26.21    |   18.69   |     [config](xcit-small-12-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty_in1k_20230213-9e364ce3.pth) |
+| `xcit-small-12-p8_3rdparty-dist_in1k`\*         |   26.21    |   18.69   |     [config](xcit-small-12-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k_20230213-71886580.pth) |
+| `xcit-small-24-p16_3rdparty-dist_in1k-384px`\*  |   47.67    |   26.72   | [config](xcit-small-24-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k-384px_20230213-28fa2d0e.pth) |
+| `xcit-tiny-24-p8_3rdparty-dist_in1k-384px`\*    |   12.11    |   27.05   |  [config](xcit-tiny-24-p8_8xb128_in1k-384px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k-384px_20230213-30d5e5ec.pth) |
+| `xcit-small-24-p8_3rdparty_in1k`\*              |   47.63    |   35.81   |     [config](xcit-small-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty_in1k_20230213-280ebcc7.pth) |
+| `xcit-small-24-p8_3rdparty-dist_in1k`\*         |   47.63    |   35.81   |     [config](xcit-small-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k_20230213-f2773c78.pth) |
+| `xcit-large-24-p16_3rdparty_in1k`\*             |   189.10   |   35.86   |    [config](xcit-large-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty_in1k_20230214-d29d2529.pth) |
+| `xcit-large-24-p16_3rdparty-dist_in1k`\*        |   189.10   |   35.86   |    [config](xcit-large-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k_20230214-4fea599c.pth) |
+| `xcit-medium-24-p16_3rdparty-dist_in1k-384px`\* |   84.40    |   47.39   | [config](xcit-medium-24-p16_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k-384px_20230214-6c23a201.pth) |
+| `xcit-small-12-p8_3rdparty-dist_in1k-384px`\*   |   26.21    |   54.92   |  [config](xcit-small-12-p8_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k-384px_20230214-9f2178bc.pth) |
+| `xcit-medium-24-p8_3rdparty_in1k`\*             |   84.32    |   63.52   |    [config](xcit-medium-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty_in1k_20230214-c362850b.pth) |
+| `xcit-medium-24-p8_3rdparty-dist_in1k`\*        |   84.32    |   63.52   |    [config](xcit-medium-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k_20230214-625c953b.pth) |
+| `xcit-small-24-p8_3rdparty-dist_in1k-384px`\*   |   47.63    |  105.24   |  [config](xcit-small-24-p8_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k-384px_20230214-57298eca.pth) |
+| `xcit-large-24-p16_3rdparty-dist_in1k-384px`\*  |   189.10   |  105.35   | [config](xcit-large-24-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k-384px_20230214-bd515a34.pth) |
+| `xcit-large-24-p8_3rdparty_in1k`\*              |   188.93   |  141.23   |     [config](xcit-large-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty_in1k_20230214-08f2f664.pth) |
+| `xcit-large-24-p8_3rdparty-dist_in1k`\*         |   188.93   |  141.23   |     [config](xcit-large-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k_20230214-8c092b34.pth) |
+| `xcit-medium-24-p8_3rdparty-dist_in1k-384px`\*  |   84.32    |  186.67   | [config](xcit-medium-24-p8_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k-384px_20230214-5db925e0.pth) |
+| `xcit-large-24-p8_3rdparty-dist_in1k-384px`\*   |   188.93   |  415.00   |  [config](xcit-large-24-p8_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k-384px_20230214-9f718b1a.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/xcit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{el2021xcit,
+  title={XCiT: Cross-Covariance Image Transformers},
+  author={El-Nouby, Alaaeldin and Touvron, Hugo and Caron, Mathilde and Bojanowski, Piotr and Douze, Matthijs and Joulin, Armand and Laptev, Ivan and Neverova, Natalia and Synnaeve, Gabriel and Verbeek, Jakob and others},
+  journal={arXiv preprint arXiv:2106.09681},
+  year={2021}
+}
+```
diff --git a/configs/xcit/metafile.yml b/configs/xcit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8379da1927cae6a45433351ca0b930b54f0e9ba7
--- /dev/null
+++ b/configs/xcit/metafile.yml
@@ -0,0 +1,727 @@
+Collections:
+  - Name: XCiT
+    Metadata:
+      Architecture:
+        - Class Attention
+        - Local Patch Interaction
+        - Cross-Covariance Attention
+    Paper:
+      Title: 'XCiT: Cross-Covariance Image Transformers'
+      URL: https://arxiv.org/abs/2106.09681
+    README: configs/xcit/README.md
+
+Models:
+  - Name: xcit-nano-12-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 557074560
+      Parameters: 3053224
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.35
+          Top 5 Accuracy: 89.98
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty_in1k_20230213-ed776c38.pth
+    Config: configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p16_224.pth
+  - Name: xcit-nano-12-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 557074560
+      Parameters: 3053224
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.36
+          Top 5 Accuracy: 91.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k_20230213-fb247f7b.pth
+    Config: configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p16_224_dist.pth
+  - Name: xcit-tiny-12-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 1239698112
+      Parameters: 6716272
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.21
+          Top 5 Accuracy: 93.62
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty_in1k_20230213-82c547ca.pth
+    Config: configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p16_224.pth
+  - Name: xcit-tiny-12-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 1239698112
+      Parameters: 6716272
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.7
+          Top 5 Accuracy: 94.12
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k_20230213-d5fde0a3.pth
+    Config: configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p16_224_dist.pth
+  - Name: xcit-nano-12-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 1636347520
+      Parameters: 3053224
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.93
+          Top 5 Accuracy: 92.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k-384px_20230213-712db4d4.pth
+    Config: configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p16_384_dist.pth
+  - Name: xcit-nano-12-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 2156861056
+      Parameters: 3049016
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.8
+          Top 5 Accuracy: 92.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty_in1k_20230213-3370c293.pth
+    Config: configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p8_224.pth
+  - Name: xcit-nano-12-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 2156861056
+      Parameters: 3049016
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.17
+          Top 5 Accuracy: 93.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k_20230213-2f87d2b3.pth
+    Config: configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p8_224_dist.pth
+  - Name: xcit-tiny-24-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 2339305152
+      Parameters: 12116896
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.47
+          Top 5 Accuracy: 94.85
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty_in1k_20230213-366c1cd0.pth
+    Config: configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p16_224.pth
+  - Name: xcit-tiny-24-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 2339305152
+      Parameters: 12116896
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.51
+          Top 5 Accuracy: 95.17
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k_20230213-b472e80a.pth
+    Config: configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p16_224_dist.pth
+  - Name: xcit-tiny-12-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 3641468352
+      Parameters: 6716272
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.58
+          Top 5 Accuracy: 95.38
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k-384px_20230213-00a20023.pth
+    Config: configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p16_384_dist.pth
+  - Name: xcit-tiny-12-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 4807399872
+      Parameters: 6706504
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.75
+          Top 5 Accuracy: 94.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty_in1k_20230213-8b02f8f5.pth
+    Config: configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p8_224.pth
+  - Name: xcit-tiny-12-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 4807399872
+      Parameters: 6706504
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.26
+          Top 5 Accuracy: 95.46
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k_20230213-f3f9b44f.pth
+    Config: configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p8_224_dist.pth
+  - Name: xcit-small-12-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 4814951808
+      Parameters: 26253304
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.87
+          Top 5 Accuracy: 95.77
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty_in1k_20230213-d36779d2.pth
+    Config: configs/xcit/xcit-small-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_224.pth
+  - Name: xcit-small-12-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 4814951808
+      Parameters: 26253304
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.12
+          Top 5 Accuracy: 96.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k_20230213-c95bbae1.pth
+    Config: configs/xcit/xcit-small-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_224_dist.pth
+  - Name: xcit-nano-12-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 6337760896
+      Parameters: 3049016
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.69
+          Top 5 Accuracy: 94.09
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k-384px_20230213-09d925ef.pth
+    Config: configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p8_384_dist.pth
+  - Name: xcit-tiny-24-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 6872966592
+      Parameters: 12116896
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.43
+          Top 5 Accuracy: 96.2
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k-384px_20230213-20e13917.pth
+    Config: configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p16_384_dist.pth
+  - Name: xcit-small-24-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 9095064960
+      Parameters: 47671384
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.38
+          Top 5 Accuracy: 95.93
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty_in1k_20230213-40febe38.pth
+    Config: configs/xcit/xcit-small-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p16_224.pth
+  - Name: xcit-small-24-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 9095064960
+      Parameters: 47671384
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+          Top 5 Accuracy: 96.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k_20230213-130d7262.pth
+    Config: configs/xcit/xcit-small-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p16_224_dist.pth
+  - Name: xcit-tiny-24-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 9205828032
+      Parameters: 12107128
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.7
+          Top 5 Accuracy: 95.9
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty_in1k_20230213-4b9ba392.pth
+    Config: configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p8_224.pth
+  - Name: xcit-tiny-24-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 9205828032
+      Parameters: 12107128
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.62
+          Top 5 Accuracy: 96.16
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k_20230213-ad9c44b0.pth
+    Config: configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p8_224_dist.pth
+  - Name: xcit-tiny-12-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 14126142912
+      Parameters: 6706504
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.46
+          Top 5 Accuracy: 96.22
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k-384px_20230213-a072174a.pth
+    Config: configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p8_384_dist.pth
+  - Name: xcit-small-12-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 14143179648
+      Parameters: 26253304
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.74
+          Top 5 Accuracy: 97.19
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k-384px_20230213-ba36c982.pth
+    Config: configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_384_dist.pth
+  - Name: xcit-medium-24-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 16129561088
+      Parameters: 84395752
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.56
+          Top 5 Accuracy: 95.82
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty_in1k_20230213-ad0aa92e.pth
+    Config: configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p16_224.pth
+  - Name: xcit-medium-24-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 16129561088
+      Parameters: 84395752
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.15
+          Top 5 Accuracy: 96.82
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k_20230213-aca5cd0c.pth
+    Config: configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p16_224_dist.pth
+  - Name: xcit-small-12-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 18691601280
+      Parameters: 26213032
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.21
+          Top 5 Accuracy: 96.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty_in1k_20230213-9e364ce3.pth
+    Config: configs/xcit/xcit-small-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p8_224.pth
+  - Name: xcit-small-12-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 18691601280
+      Parameters: 26213032
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.97
+          Top 5 Accuracy: 96.81
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k_20230213-71886580.pth
+    Config: configs/xcit/xcit-small-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p8_224_dist.pth
+  - Name: xcit-small-24-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 26721471360
+      Parameters: 47671384
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.1
+          Top 5 Accuracy: 97.32
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k-384px_20230213-28fa2d0e.pth
+    Config: configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p16_384_dist.pth
+  - Name: xcit-tiny-24-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 27052135872
+      Parameters: 12107128
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.77
+          Top 5 Accuracy: 96.72
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k-384px_20230213-30d5e5ec.pth
+    Config: configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p8_384_dist.pth
+  - Name: xcit-small-24-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 35812053888
+      Parameters: 47631112
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.62
+          Top 5 Accuracy: 96.51
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty_in1k_20230213-280ebcc7.pth
+    Config: configs/xcit/xcit-small-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p8_224.pth
+  - Name: xcit-small-24-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 35812053888
+      Parameters: 47631112
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.68
+          Top 5 Accuracy: 97.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k_20230213-f2773c78.pth
+    Config: configs/xcit/xcit-small-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p8_224_dist.pth
+  - Name: xcit-large-24-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 35855948544
+      Parameters: 189096136
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.97
+          Top 5 Accuracy: 95.86
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty_in1k_20230214-d29d2529.pth
+    Config: configs/xcit/xcit-large-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p16_224.pth
+  - Name: xcit-large-24-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 35855948544
+      Parameters: 189096136
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.61
+          Top 5 Accuracy: 97.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k_20230214-4fea599c.pth
+    Config: configs/xcit/xcit-large-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p16_224_dist.pth
+  - Name: xcit-medium-24-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 47388932608
+      Parameters: 84395752
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.47
+          Top 5 Accuracy: 97.49
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k-384px_20230214-6c23a201.pth
+    Config: configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p16_384_dist.pth
+  - Name: xcit-small-12-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 54923537280
+      Parameters: 26213032
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.12
+          Top 5 Accuracy: 97.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k-384px_20230214-9f2178bc.pth
+    Config: configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p8_384_dist.pth
+  - Name: xcit-medium-24-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 63524706816
+      Parameters: 84323624
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.61
+          Top 5 Accuracy: 96.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty_in1k_20230214-c362850b.pth
+    Config: configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p8_224.pth
+  - Name: xcit-medium-24-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 63524706816
+      Parameters: 84323624
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.0
+          Top 5 Accuracy: 97.16
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k_20230214-625c953b.pth
+    Config: configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p8_224_dist.pth
+  - Name: xcit-small-24-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 105236704128
+      Parameters: 47631112
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.57
+          Top 5 Accuracy: 97.6
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k-384px_20230214-57298eca.pth
+    Config: configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p8_384_dist.pth
+  - Name: xcit-large-24-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 105345095424
+      Parameters: 189096136
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.78
+          Top 5 Accuracy: 97.6
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k-384px_20230214-bd515a34.pth
+    Config: configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p16_384_dist.pth
+  - Name: xcit-large-24-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 141225699072
+      Parameters: 188932648
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.23
+          Top 5 Accuracy: 96.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty_in1k_20230214-08f2f664.pth
+    Config: configs/xcit/xcit-large-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p8_224.pth
+  - Name: xcit-large-24-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 141225699072
+      Parameters: 188932648
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.14
+          Top 5 Accuracy: 97.32
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k_20230214-8c092b34.pth
+    Config: configs/xcit/xcit-large-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p8_224_dist.pth
+  - Name: xcit-medium-24-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 186672626176
+      Parameters: 84323624
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.87
+          Top 5 Accuracy: 97.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k-384px_20230214-5db925e0.pth
+    Config: configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p8_384_dist.pth
+  - Name: xcit-large-24-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 415003137792
+      Parameters: 188932648
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.13
+          Top 5 Accuracy: 97.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k-384px_20230214-9f718b1a.pth
+    Config: configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p8_384_dist.pth
diff --git a/configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b393c4aea03ab1927e11773609562cd323963931
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=768,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-large-24-p16_8xb128_in1k.py b/configs/xcit/xcit-large-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5c01cb5f72e93ad8b5e81d363b3c3f914504f64
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=768,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..46b8422b481e69100266798a2183cae56d6e345e
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=768,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-large-24-p8_8xb128_in1k.py b/configs/xcit/xcit-large-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6dc67baa59b9e270b2c06bb0a928879ef8f78f60
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=768,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c91b9cd6e9511a8dbbae437a5454d35eb4c03e0
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=512,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p16_8xb128_in1k.py b/configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..148ed0640da548877cbf04c67bfc0bbb3351dfce
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=512,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..3138ec4f0b41456d99e2d59d60575327e794f10e
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=512,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p8_8xb128_in1k.py b/configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8277a10b772aa3c7a39ace2051829c8818df987
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=512,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf8c27b3b1acee69892fa83a8be40da82b62fd44
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=128,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=False,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=128,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p16_8xb128_in1k.py b/configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e9bf81c5f4639ee5c7ba57c9ef996c79076df65
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=128,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=False,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=128,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7dae69f0b3b9a2ea8792f0beed8e0ee68f0cc4e9
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=128,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=False,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=128,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p8_8xb128_in1k.py b/configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6a003a30ef7348f29732ca1c36210704e886c1c
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=128,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=False,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=128,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..54c80d498e0c1370f1122ee34ef1970a521796a7
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=384,
+        depth=12,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p16_8xb128_in1k.py b/configs/xcit/xcit-small-12-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c546179f42f7a0a668d3d7f8d27ae137006577ae
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=384,
+        depth=12,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1b6a52c370578f9fe9420521d1bc494563071e6
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=384,
+        depth=12,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p8_8xb128_in1k.py b/configs/xcit/xcit-small-12-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cbfbe151781fb012fae2099bb0a9b9bd5d7e563e
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=384,
+        depth=12,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6eb41275b83939e2ac71f5e6e15fa2a8bf5f4df2
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=384,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p16_8xb128_in1k.py b/configs/xcit/xcit-small-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b3dc18f438ffb49bde71a795e24abf36c427e14
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=384,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..34445a09d637c222a25aa608de2f99bf1dacedb1
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=384,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p8_8xb128_in1k.py b/configs/xcit/xcit-small-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..108e64d41ae0c34c17bc5e6a5baa6d46eb6a9d08
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=384,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b64ebe497082ef6f9c4b93ad16e7343f66008e07
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=192,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1b54592f88bad986e885129bbce9d585fb864206
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=192,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1acff7ead898fb45c8ab6eac5aa3ed3dd13d939
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=192,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..39d97da21689382d0e6b168fd78f9a74b269e8c1
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=192,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..556043565e2e844f77a2a2b62e7ebe71d638590d
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=192,
+        depth=24,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fdceb14323ac89a12d529f7112806fef7e6f9d66
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=192,
+        depth=24,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cee442e5b77481550d479c4f83cb2e9a80e46ae
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=192,
+        depth=24,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..283f17e61708e9d19e5af09c57d8a937cec2e854
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=192,
+        depth=24,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/dataset-index.yml b/dataset-index.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40ca62069295695d896134b60e66b2260066c072
--- /dev/null
+++ b/dataset-index.yml
@@ -0,0 +1,11 @@
+imagenet1k:
+  dataset: OpenDataLab/ImageNet-1K
+  download_root: data
+  data_root: data/imagenet
+  script: tools/dataset_converters/odl_imagenet1k_preprocess.sh
+
+cub:
+  dataset: OpenDataLab/CUB-200-2011
+  download_root: data
+  data_root: data/CUB_200_2011
+  script: tools/dataset_converters/odl_cub_preprocess.sh
diff --git a/model-index.yml b/model-index.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1bd928533e1be02db1dea58dbf6c52b2bde45f45
--- /dev/null
+++ b/model-index.yml
@@ -0,0 +1,85 @@
+Import:
+  - configs/mobilenet_v2/metafile.yml
+  - configs/mobilenet_v3/metafile.yml
+  - configs/resnet/metafile.yml
+  - configs/res2net/metafile.yml
+  - configs/resnext/metafile.yml
+  - configs/seresnet/metafile.yml
+  - configs/shufflenet_v1/metafile.yml
+  - configs/shufflenet_v2/metafile.yml
+  - configs/swin_transformer/metafile.yml
+  - configs/vgg/metafile.yml
+  - configs/repvgg/metafile.yml
+  - configs/tnt/metafile.yml
+  - configs/vision_transformer/metafile.yml
+  - configs/t2t_vit/metafile.yml
+  - configs/tinyvit/metafile.yml
+  - configs/mlp_mixer/metafile.yml
+  - configs/conformer/metafile.yml
+  - configs/regnet/metafile.yml
+  - configs/deit/metafile.yml
+  - configs/twins/metafile.yml
+  - configs/efficientnet/metafile.yml
+  - configs/convnext/metafile.yml
+  - configs/hrnet/metafile.yml
+  - configs/repmlp/metafile.yml
+  - configs/wrn/metafile.yml
+  - configs/van/metafile.yml
+  - configs/cspnet/metafile.yml
+  - configs/convmixer/metafile.yml
+  - configs/densenet/metafile.yml
+  - configs/poolformer/metafile.yml
+  - configs/inception_v3/metafile.yml
+  - configs/mvit/metafile.yml
+  - configs/edgenext/metafile.yml
+  - configs/mobileone/metafile.yml
+  - configs/efficientformer/metafile.yml
+  - configs/swin_transformer_v2/metafile.yml
+  - configs/deit3/metafile.yml
+  - configs/hornet/metafile.yml
+  - configs/mobilevit/metafile.yml
+  - configs/davit/metafile.yml
+  - configs/replknet/metafile.yml
+  - configs/csra/metafile.yml
+  - configs/beit/metafile.yml
+  - configs/beitv2/metafile.yml
+  - configs/eva/metafile.yml
+  - configs/revvit/metafile.yml
+  - configs/clip/metafile.yml
+  - configs/mixmim/metafile.yml
+  - configs/efficientnet_v2/metafile.yml
+  - configs/convnext_v2/metafile.yml
+  - configs/levit/metafile.yml
+  - configs/vig/metafile.yml
+  - configs/arcface/metafile.yml
+  - configs/xcit/metafile.yml
+  - configs/byol/metafile.yml
+  - configs/densecl/metafile.yml
+  - configs/mocov2/metafile.yml
+  - configs/mocov3/metafile.yml
+  - configs/simclr/metafile.yml
+  - configs/simsiam/metafile.yml
+  - configs/swav/metafile.yml
+  - configs/mae/metafile.yml
+  - configs/simmim/metafile.yml
+  - configs/barlowtwins/metafile.yml
+  - configs/cae/metafile.yml
+  - configs/maskfeat/metafile.yml
+  - configs/milan/metafile.yml
+  - configs/ofa/metafile.yml
+  - configs/riformer/metafile.yml
+  - configs/sam/metafile.yml
+  - configs/glip/metafile.yml
+  - configs/eva02/metafile.yml
+  - configs/dinov2/metafile.yml
+  - configs/blip/metafile.yml
+  - configs/flamingo/metafile.yml
+  - configs/blip2/metafile.yml
+  - configs/chinese_clip/metafile.yml
+  - configs/itpn/metafile.yml
+  - configs/hivit/metafile.yml
+  - configs/spark/metafile.yml
+  - configs/minigpt4/metafile.yml
+  - configs/llava/metafile.yml
+  - configs/otter/metafile.yml
+  - configs/mff/metafile.yml
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6da5adea757ffc79ac35e544d4afe85c5f44a90d
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,3 @@
+-r requirements/optional.txt
+-r requirements/runtime.txt
+-r requirements/tests.txt
diff --git a/requirements/docs.txt b/requirements/docs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..208d8ac0add9f5a2603cf45a549d986c7d2ce2ce
--- /dev/null
+++ b/requirements/docs.txt
@@ -0,0 +1,10 @@
+docutils==0.18.1
+modelindex
+myst-parser
+git+https://github.com/mzr1996/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
+sphinx==6.1.3
+sphinx-copybutton
+sphinx-notfound-page
+sphinx-tabs
+sphinxcontrib-jquery
+tabulate
diff --git a/requirements/mminstall.txt b/requirements/mminstall.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9b736b028bdc9d8b5f8f53121f9e41234b4ba424
--- /dev/null
+++ b/requirements/mminstall.txt
@@ -0,0 +1,2 @@
+mmcv>=2.0.0,<2.4.0
+mmengine>=0.8.3,<1.0.0
diff --git a/requirements/multimodal.txt b/requirements/multimodal.txt
new file mode 100644
index 0000000000000000000000000000000000000000..f6150b16d8e5b6deab4e4b34bec25d5ceeb6bf1a
--- /dev/null
+++ b/requirements/multimodal.txt
@@ -0,0 +1,2 @@
+pycocotools
+transformers>=4.28.0
diff --git a/requirements/optional.txt b/requirements/optional.txt
new file mode 100644
index 0000000000000000000000000000000000000000..5f31808f14b9259e18fcd2d2b056b0c611b09131
--- /dev/null
+++ b/requirements/optional.txt
@@ -0,0 +1,4 @@
+albumentations>=0.3.2 --no-binary qudida,albumentations    # For Albumentations data transform
+grad-cam >= 1.3.7,<1.5.0   # For CAM visualization
+requests            # For torchserve
+scikit-learn        # For t-SNE visualization and unit tests.
diff --git a/requirements/readthedocs.txt b/requirements/readthedocs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..145cedab5b84aa9bf0900bd5361627251c6337e8
--- /dev/null
+++ b/requirements/readthedocs.txt
@@ -0,0 +1,7 @@
+--extra-index-url https://download.pytorch.org/whl/cpu
+mmcv-lite>=2.0.0rc4
+mmengine
+pycocotools
+torch
+torchvision
+transformers
diff --git a/requirements/runtime.txt b/requirements/runtime.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e0b0d903f3d77635cd475e531f8142ead65e3b06
--- /dev/null
+++ b/requirements/runtime.txt
@@ -0,0 +1,7 @@
+einops
+importlib-metadata
+mat4py
+matplotlib
+modelindex
+numpy
+rich
diff --git a/requirements/tests.txt b/requirements/tests.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ed0110fe120fe49910012e046c0afd29973bf509
--- /dev/null
+++ b/requirements/tests.txt
@@ -0,0 +1,3 @@
+coverage
+interrogate
+pytest
diff --git a/setup.cfg b/setup.cfg
new file mode 100644
index 0000000000000000000000000000000000000000..06455344af48c02c611ac95b6f84d76d1de3ec46
--- /dev/null
+++ b/setup.cfg
@@ -0,0 +1,33 @@
+[bdist_wheel]
+universal=1
+
+[aliases]
+test=pytest
+
+[yapf]
+based_on_style = pep8
+blank_line_before_nested_class_or_def = true
+split_before_expression_after_opening_paren = true
+
+[isort]
+line_length = 79
+multi_line_output = 0
+extra_standard_library = pkg_resources,setuptools
+known_first_party = mmpretrain
+no_lines_before = STDLIB,LOCALFOLDER
+default_section = THIRDPARTY
+
+[codespell]
+skip = *.ipynb
+quiet-level = 3
+ignore-words-list = patten,confectionary,nd,ty,formating,dows
+
+[flake8]
+# The E251 check is conflict with yapf in some situation.
+# See https://github.com/google/yapf/issues/393
+extend-ignore = E251
+# The F401 check is wrong if the `__all__` variable is modified
+# in `__init__.py`
+per-file-ignores =
+    */__init__.py: F401
+    mmpretrain/configs/*: F401,F403,F405
diff --git a/setup.py b/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..e68dff2be8d37d84b98cf1face47a3568e8d0068
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,198 @@
+import os
+import os.path as osp
+import shutil
+import sys
+import warnings
+from setuptools import find_packages, setup
+
+
+def readme():
+    with open('README.md', encoding='utf-8') as f:
+        content = f.read()
+    return content
+
+
+def get_version():
+    version_file = 'mmpretrain/version.py'
+    with open(version_file, 'r', encoding='utf-8') as f:
+        exec(compile(f.read(), version_file, 'exec'))
+    return locals()['__version__']
+
+
+def parse_requirements(fname='requirements.txt', with_version=True):
+    """Parse the package dependencies listed in a requirements file but strips
+    specific versioning information.
+
+    Args:
+        fname (str): path to requirements file
+        with_version (bool, default=False): if True include version specs
+
+    Returns:
+        List[str]: list of requirements items
+
+    CommandLine:
+        python -c "import setup; print(setup.parse_requirements())"
+    """
+    import re
+    import sys
+    from os.path import exists
+    require_fpath = fname
+
+    def parse_line(line):
+        """Parse information from a line in a requirements text file."""
+        if line.startswith('-r '):
+            # Allow specifying requirements in other files
+            target = line.split(' ')[1]
+            for info in parse_require_file(target):
+                yield info
+        else:
+            info = {'line': line}
+            if line.startswith('-e '):
+                info['package'] = line.split('#egg=')[1]
+            else:
+                # Remove versioning from the package
+                pat = '(' + '|'.join(['>=', '==', '>']) + ')'
+                parts = re.split(pat, line, maxsplit=1)
+                parts = [p.strip() for p in parts]
+
+                info['package'] = parts[0]
+                if len(parts) > 1:
+                    op, rest = parts[1:]
+                    if ';' in rest:
+                        # Handle platform specific dependencies
+                        # http://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
+                        version, platform_deps = map(str.strip,
+                                                     rest.split(';'))
+                        info['platform_deps'] = platform_deps
+                    else:
+                        version = rest  # NOQA
+                    if '--' in version:
+                        # the `extras_require` doesn't accept options.
+                        version = version.split('--')[0].strip()
+                    info['version'] = (op, version)
+            yield info
+
+    def parse_require_file(fpath):
+        with open(fpath, 'r') as f:
+            for line in f.readlines():
+                line = line.strip()
+                if line and not line.startswith('#'):
+                    for info in parse_line(line):
+                        yield info
+
+    def gen_packages_items():
+        if exists(require_fpath):
+            for info in parse_require_file(require_fpath):
+                parts = [info['package']]
+                if with_version and 'version' in info:
+                    parts.extend(info['version'])
+                if not sys.version.startswith('3.4'):
+                    # apparently package_deps are broken in 3.4
+                    platform_deps = info.get('platform_deps')
+                    if platform_deps is not None:
+                        parts.append(';' + platform_deps)
+                item = ''.join(parts)
+                yield item
+
+    packages = list(gen_packages_items())
+    return packages
+
+
+def add_mim_extension():
+    """Add extra files that are required to support MIM into the package.
+
+    These files will be added by creating a symlink to the originals if the
+    package is installed in `editable` mode (e.g. pip install -e .), or by
+    copying from the originals otherwise.
+    """
+
+    # parse installment mode
+    if 'develop' in sys.argv:
+        # installed by `pip install -e .`
+        mode = 'symlink'
+    elif 'sdist' in sys.argv or 'bdist_wheel' in sys.argv:
+        # installed by `pip install .`
+        # or create source distribution by `python setup.py sdist`
+        mode = 'copy'
+    else:
+        return
+
+    filenames = ['tools', 'configs', 'model-index.yml', 'dataset-index.yml']
+    repo_path = osp.dirname(__file__)
+    mim_path = osp.join(repo_path, 'mmpretrain', '.mim')
+    os.makedirs(mim_path, exist_ok=True)
+
+    for filename in filenames:
+        if osp.exists(filename):
+            src_path = osp.join(repo_path, filename)
+            tar_path = osp.join(mim_path, filename)
+
+            if osp.isfile(tar_path) or osp.islink(tar_path):
+                os.remove(tar_path)
+            elif osp.isdir(tar_path):
+                shutil.rmtree(tar_path)
+
+            if mode == 'symlink':
+                src_relpath = osp.relpath(src_path, osp.dirname(tar_path))
+                try:
+                    os.symlink(src_relpath, tar_path)
+                except OSError:
+                    # Creating a symbolic link on windows may raise an
+                    # `OSError: [WinError 1314]` due to privilege. If
+                    # the error happens, the src file will be copied
+                    mode = 'copy'
+                    warnings.warn(
+                        f'Failed to create a symbolic link for {src_relpath}, '
+                        f'and it will be copied to {tar_path}')
+                else:
+                    continue
+
+            if mode == 'copy':
+                if osp.isfile(src_path):
+                    shutil.copyfile(src_path, tar_path)
+                elif osp.isdir(src_path):
+                    shutil.copytree(src_path, tar_path)
+                else:
+                    warnings.warn(f'Cannot copy file {src_path}.')
+            else:
+                raise ValueError(f'Invalid mode {mode}')
+
+
+if __name__ == '__main__':
+    add_mim_extension()
+    setup(
+        name='mmpretrain',
+        version=get_version(),
+        description='OpenMMLab Model Pretraining Toolbox and Benchmark',
+        long_description=readme(),
+        long_description_content_type='text/markdown',
+        keywords='computer vision, image classification, '
+        'unsupervised learning, self-supervised learning',
+        packages=find_packages(exclude=('configs', 'tools', 'demo', 'tests')),
+        include_package_data=True,
+        python_requires='>=3.7',
+        classifiers=[
+            'Development Status :: 4 - Beta',
+            'License :: OSI Approved :: Apache Software License',
+            'Operating System :: OS Independent',
+            'Programming Language :: Python :: 3',
+            'Programming Language :: Python :: 3.7',
+            'Programming Language :: Python :: 3.8',
+            'Programming Language :: Python :: 3.9',
+            'Programming Language :: Python :: 3.10',
+            'Programming Language :: Python :: 3.11',
+            'Topic :: Scientific/Engineering :: Artificial Intelligence',
+        ],
+        url='https://github.com/open-mmlab/mmpretrain',
+        author='MMPretrain Contributors',
+        author_email='openmmlab@gmail.com',
+        license='Apache License 2.0',
+        install_requires=parse_requirements('requirements/runtime.txt'),
+        extras_require={
+            'all': parse_requirements('requirements.txt'),
+            'tests': parse_requirements('requirements/tests.txt'),
+            'optional': parse_requirements('requirements/optional.txt'),
+            'mim': parse_requirements('requirements/mminstall.txt'),
+            'multimodal': parse_requirements('requirements/multimodal.txt'),
+        },
+        zip_safe=False)