diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 0000000000000000000000000000000000000000..81ea8f792645b1904e792918590eb215c62dd323
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,9 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+title: "OpenMMLab's Pre-training Toolbox and Benchmark"
+authors:
+ - name: "MMPreTrain Contributors"
+version: 0.15.0
+date-released: 2023-04-06
+repository-code: "https://github.com/open-mmlab/mmpretrain"
+license: Apache-2.0
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 0000000000000000000000000000000000000000..ce84c2a09f59785d3220a722b8ba1282c97a8030
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,73 @@
+# Contributing to MMPreTrain
+
+- [Contributing to MMPreTrain](#contributing-to-mmpretrain)
+ - [Workflow](#workflow)
+ - [Code style](#code-style)
+ - [Python](#python)
+ - [C++ and CUDA](#c-and-cuda)
+ - [Pre-commit Hook](#pre-commit-hook)
+
+Thanks for your interest in contributing to MMPreTrain! All kinds of contributions are welcome, including but not limited to the following.
+
+- Fix typo or bugs
+- Add documentation or translate the documentation into other languages
+- Add new features and components
+
+## Workflow
+
+We recommend the potential contributors follow this workflow for contribution.
+
+1. Fork and pull the latest MMPreTrain repository, follow [get started](https://mmpretrain.readthedocs.io/en/latest/get_started.html) to setup the environment.
+2. Checkout a new branch (**do not use the master or dev branch** for PRs)
+
+```bash
+git checkout -b xxxx # xxxx is the name of new branch
+```
+
+3. Edit the related files follow the code style mentioned below
+4. Use [pre-commit hook](https://pre-commit.com/) to check and format your changes.
+5. Commit your changes
+6. Create a PR with related information
+
+## Code style
+
+### Python
+
+We adopt [PEP8](https://www.python.org/dev/peps/pep-0008/) as the preferred code style.
+
+We use the following tools for linting and formatting:
+
+- [flake8](https://github.com/PyCQA/flake8): A wrapper around some linter tools.
+- [isort](https://github.com/timothycrosley/isort): A Python utility to sort imports.
+- [yapf](https://github.com/google/yapf): A formatter for Python files.
+- [codespell](https://github.com/codespell-project/codespell): A Python utility to fix common misspellings in text files.
+- [mdformat](https://github.com/executablebooks/mdformat): Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files.
+- [docformatter](https://github.com/myint/docformatter): A formatter to format docstring.
+
+Style configurations of yapf and isort can be found in [setup.cfg](https://github.com/open-mmlab/mmpretrain/blob/main/setup.cfg).
+
+### C++ and CUDA
+
+We follow the [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html).
+
+## Pre-commit Hook
+
+We use [pre-commit hook](https://pre-commit.com/) that checks and formats for `flake8`, `yapf`, `isort`, `trailing whitespaces`, `markdown files`,
+fixes `end-of-files`, `double-quoted-strings`, `python-encoding-pragma`, `mixed-line-ending`, sorts `requirments.txt` automatically on every commit.
+The config for a pre-commit hook is stored in [.pre-commit-config](https://github.com/open-mmlab/mmpretrain/blob/main/.pre-commit-config.yaml).
+
+After you clone the repository, you will need to install initialize pre-commit hook.
+
+```shell
+pip install -U pre-commit
+```
+
+From the repository folder
+
+```shell
+pre-commit install
+```
+
+After this on every commit check code linters and formatter will be enforced.
+
+> Before you create a PR, make sure that your code lints and is formatted by yapf.
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..ae87343779455c4c4b43e10a27d1657142666726
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,203 @@
+Copyright (c) OpenMMLab. All rights reserved
+
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright 2020 MMPreTrain Authors.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
diff --git a/MANIFEST.in b/MANIFEST.in
new file mode 100644
index 0000000000000000000000000000000000000000..ad4d8dafbdeb31327429c94430a8338e5f024acb
--- /dev/null
+++ b/MANIFEST.in
@@ -0,0 +1,5 @@
+include requirements/*.txt
+include mmpretrain/.mim/model-index.yml
+include mmpretrain/.mim/dataset-index.yml
+recursive-include mmpretrain/.mim/configs *.py *.yml
+recursive-include mmpretrain/.mim/tools *.py *.sh
diff --git a/README.md b/README.md
index 4301dc733a12bb83bbaac7e18b645284677db06c..5318df5b958b8f54dcba1896776eebfb04ba9871 100644
--- a/README.md
+++ b/README.md
@@ -1,123 +1,339 @@
-# Mobilenetv2
+
+
+

+
+
+
+
+[](https://pypi.org/project/mmpretrain)
+[](https://mmpretrain.readthedocs.io/en/latest/)
+[](https://github.com/open-mmlab/mmpretrain/actions)
+[](https://codecov.io/gh/open-mmlab/mmpretrain)
+[](https://github.com/open-mmlab/mmpretrain/blob/main/LICENSE)
+[](https://github.com/open-mmlab/mmpretrain/issues)
+[](https://github.com/open-mmlab/mmpretrain/issues)
+
+[📘 Documentation](https://mmpretrain.readthedocs.io/en/latest/) |
+[🛠️ Installation](https://mmpretrain.readthedocs.io/en/latest/get_started.html#installation) |
+[👀 Model Zoo](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html) |
+[🆕 Update News](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html) |
+[🤔 Reporting Issues](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+
+

+
+English | [简体中文](/README_zh-CN.md)
+
+
+
+
+
+
+
+ 
+

+
+ 
+

+
+ 
+

+
+ 
+

+
+ 
+

+
+ 
+
+
+## Introduction
+
+MMPreTrain is an open source pre-training toolbox based on PyTorch. It is a part of the [OpenMMLab](https://openmmlab.com/) project.
+
+The `main` branch works with **PyTorch 1.8+**.
+
+### Major features
+
+- Various backbones and pretrained models
+- Rich training strategies (supervised learning, self-supervised learning, multi-modality learning etc.)
+- Bag of training tricks
+- Large-scale training configs
+- High efficiency and extensibility
+- Powerful toolkits for model analysis and experiments
+- Various out-of-box inference tasks.
+ - Image Classification
+ - Image Caption
+ - Visual Question Answering
+ - Visual Grounding
+ - Retrieval (Image-To-Image, Text-To-Image, Image-To-Text)
+
+https://github.com/open-mmlab/mmpretrain/assets/26739999/e4dcd3a2-f895-4d1b-a351-fbc74a04e904
+
+## What's new
+
+🌟 v1.2.0 was released in 04/01/2023
+
+- Support LLaVA 1.5.
+- Implement of RAM with a gradio interface.
+
+🌟 v1.1.0 was released in 12/10/2023
+
+- Support Mini-GPT4 training and provide a Chinese model (based on Baichuan-7B)
+- Support zero-shot classification based on CLIP.
+
+🌟 v1.0.0 was released in 04/07/2023
+
+- Support inference of more **multi-modal** algorithms, such as [**LLaVA**](./configs/llava/), [**MiniGPT-4**](./configs/minigpt4), [**Otter**](./configs/otter/), etc.
+- Support around **10 multi-modal** datasets!
+- Add [**iTPN**](./configs/itpn/), [**SparK**](./configs/spark/) self-supervised learning algorithms.
+- Provide examples of [New Config](./mmpretrain/configs/) and [DeepSpeed/FSDP with FlexibleRunner](./configs/mae/benchmarks/). Here are the documentation links of [New Config](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#a-pure-python-style-configuration-file-beta) and [DeepSpeed/FSDP with FlexibleRunner](https://mmengine.readthedocs.io/en/latest/api/generated/mmengine.runner.FlexibleRunner.html#mmengine.runner.FlexibleRunner).
+
+🌟 Upgrade from MMClassification to MMPreTrain
+
+- Integrated Self-supervised learning algorithms from **MMSelfSup**, such as **MAE**, **BEiT**, etc.
+- Support **RIFormer**, a simple but effective vision backbone by removing token mixer.
+- Refactor dataset pipeline visualization.
+- Support **LeViT**, **XCiT**, **ViG**, **ConvNeXt-V2**, **EVA**, **RevViT**, **EfficientnetV2**, **CLIP**, **TinyViT** and **MixMIM** backbones.
+
+This release introduced a brand new and flexible training & test engine, but it's still in progress. Welcome
+to try according to [the documentation](https://mmpretrain.readthedocs.io/en/latest/).
+
+And there are some BC-breaking changes. Please check [the migration tutorial](https://mmpretrain.readthedocs.io/en/latest/migration.html).
-## 论文
+Please refer to [changelog](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html) for more details and other release history.
-MobileNetV2: Inverted Residuals and Linear Bottlenecks
+## Installation
-- https://openaccess.thecvf.com/content_cvpr_2018/papers/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_paper.pdf
+Below are quick steps for installation:
-## 模型结构
-
-MobileNetV2是一种轻量级的卷积神经网络模型,由Google在2018年提出。它是MobileNet系列中的第二个版本,主要用于移动设备和嵌入式设备等资源受限的环境中进行图像分类、目标检测等计算机视觉任务。
-
-
-
-
-
-## 算法原理
-
-MobileNetV2的网络结构主要由两部分组成:特征提取层和分类器。
-
-
-
-## 环境配置
-
-### Docker(方法一)
-
-```python
-git clone --recursive http://developer.hpccube.com/codes/modelzoo/mobilenetv2_mmcv.git
-docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.10.0-centos7.6-dtk-22.10.1-py37-latest
-# 用以上拉取的docker的镜像ID替换
-docker run --shm-size 10g --network=host --name=mobilenetv2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $PWD/mobilenetv2_mmcv:/home/mobilenetv2_mmcv -it bash
-
-cd mobilenetv2_mmcv/mmclassification-mmcv
-pip install -r requirements.txt
+```shell
+conda create -n open-mmlab python=3.8 pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -y
+conda activate open-mmlab
+pip install openmim
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+mim install -e .
```
-### Dockerfile(方法二)
-
-```plaintext
-cd mobilenetv2_mmcv/docker
-docker build --no-cache -t mobilenetv2_mmcv:latest .
-docker run --rm --shm-size 10g --network=host --name=mobilenetv2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $PWD/../../mobilenetv2_mmcv:/home/mobilenetv2_mmcv -it bash
-# 若遇到Dockerfile启动的方式安装环境需要长时间等待,可注释掉里面的pip安装,启动容器后再安装python库:pip install -r requirements.txt
-```
+Please refer to [installation documentation](https://mmpretrain.readthedocs.io/en/latest/get_started.html) for more detailed installation and dataset preparation.
-### Anaconda(方法三)
+For multi-modality models support, please install the extra dependencies by:
-1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装: https://developer.hpccube.com/tool/
-
-```plaintext
-DTK驱动:dtk22.10.1
-python:python3.7
-torch:1.10.0
-torchvision:0.10.0
-mmcv:1.6.1
-Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应
+```shell
+mim install -e ".[multimodal]"
```
-2、其它非特殊库参照requirements.txt安装
-
-```plaintext
-pip install -r requirements.txt
-```
-
-## 数据集
-
-在本测试中可以使用ImageNet数据集。
-
-下载ImageNet数据集:https://image-net.org/
-
-下载val数据:链接:https://pan.baidu.com/s/1oXsmsYahGVG3uOZ8e535LA?pwd=c3bc 提取码:c3bc 替换ImageNet数据集中的val目录,处理后的数据结构如下:
-
-```
-data
- ├──imagenet
- ├── meta
- ├──val.txt
- ├──train.txt
- ...
- ├── train
- ├── val
-
+## User Guides
+
+We provided a series of tutorials about the basic usage of MMPreTrain for new users:
+
+- [Learn about Configs](https://mmpretrain.readthedocs.io/en/latest/user_guides/config.html)
+- [Prepare Dataset](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html)
+- [Inference with existing models](https://mmpretrain.readthedocs.io/en/latest/user_guides/inference.html)
+- [Train](https://mmpretrain.readthedocs.io/en/latest/user_guides/train.html)
+- [Test](https://mmpretrain.readthedocs.io/en/latest/user_guides/test.html)
+- [Downstream tasks](https://mmpretrain.readthedocs.io/en/latest/user_guides/downstream.html)
+
+For more information, please refer to [our documentation](https://mmpretrain.readthedocs.io/en/latest/).
+
+## Model zoo
+
+Results and models are available in the [model zoo](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html).
+
+
+ Overview
+
+
+
+
+ |
+ Supported Backbones
+ |
+
+ Self-supervised Learning
+ |
+
+ Multi-Modality Algorithms
+ |
+
+ Others
+ |
+
+
+ |
+
+ |
+
+
+ |
+
+
+ |
+
+ Image Retrieval Task:
+
+ Training&Test Tips:
+
+ |
+
+
+
+## Contributing
+
+We appreciate all contributions to improve MMPreTrain.
+Please refer to [CONTRUBUTING](https://mmpretrain.readthedocs.io/en/latest/notes/contribution_guide.html) for the contributing guideline.
+
+## Acknowledgement
+
+MMPreTrain is an open source project that is contributed by researchers and engineers from various colleges and companies. We appreciate all the contributors who implement their methods or add new features, as well as users who give valuable feedbacks.
+We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and supporting their own academic research.
+
+## Citation
+
+If you find this project useful in your research, please consider cite:
+
+```BibTeX
+@misc{2023mmpretrain,
+ title={OpenMMLab's Pre-training Toolbox and Benchmark},
+ author={MMPreTrain Contributors},
+ howpublished = {\url{https://github.com/open-mmlab/mmpretrain}},
+ year={2023}
+}
```
-SCNet快速下载链接[http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-2012
-](http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-2012
-)
-## 训练
-
-将训练数据解压到data目录下。
-
-### 单机8卡
-
- ./mobilenetv2.sh
-
-## result
-
-
-
-### 精度
-
-测试数据使用的是ImageNet数据集,使用的加速卡是DCU Z100L。
-
-| 卡数 | 精度 |
-| :--: | :-----------------------: |
-| 8 | top1:0.71764;top5:0.90386 |
-
-## 应用场景
-
-### 算法类别
-
-图像分类
-
-### 热点行业
-
-制造,能源,交通,网安
-
-## 源码仓库及问题反馈
-
-https://developer.hpccube.com/codes/modelzoo/mobilenetv2_mmcv
-
-## 参考资料
-https://github.com/open-mmlab/mmpretrain
+## License
+
+This project is released under the [Apache 2.0 license](LICENSE).
+
+## Projects in OpenMMLab
+
+- [MMEngine](https://github.com/open-mmlab/mmengine): OpenMMLab foundational library for training deep learning models.
+- [MMCV](https://github.com/open-mmlab/mmcv): OpenMMLab foundational library for computer vision.
+- [MIM](https://github.com/open-mmlab/mim): MIM installs OpenMMLab packages.
+- [MMEval](https://github.com/open-mmlab/mmeval): A unified evaluation library for multiple machine learning libraries.
+- [MMPreTrain](https://github.com/open-mmlab/mmpretrain): OpenMMLab pre-training toolbox and benchmark.
+- [MMDetection](https://github.com/open-mmlab/mmdetection): OpenMMLab detection toolbox and benchmark.
+- [MMDetection3D](https://github.com/open-mmlab/mmdetection3d): OpenMMLab's next-generation platform for general 3D object detection.
+- [MMRotate](https://github.com/open-mmlab/mmrotate): OpenMMLab rotated object detection toolbox and benchmark.
+- [MMYOLO](https://github.com/open-mmlab/mmyolo): OpenMMLab YOLO series toolbox and benchmark.
+- [MMSegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab semantic segmentation toolbox and benchmark.
+- [MMOCR](https://github.com/open-mmlab/mmocr): OpenMMLab text detection, recognition, and understanding toolbox.
+- [MMPose](https://github.com/open-mmlab/mmpose): OpenMMLab pose estimation toolbox and benchmark.
+- [MMHuman3D](https://github.com/open-mmlab/mmhuman3d): OpenMMLab 3D human parametric model toolbox and benchmark.
+- [MMSelfSup](https://github.com/open-mmlab/mmselfsup): OpenMMLab self-supervised learning toolbox and benchmark.
+- [MMRazor](https://github.com/open-mmlab/mmrazor): OpenMMLab model compression toolbox and benchmark.
+- [MMFewShot](https://github.com/open-mmlab/mmfewshot): OpenMMLab fewshot learning toolbox and benchmark.
+- [MMAction2](https://github.com/open-mmlab/mmaction2): OpenMMLab's next-generation action understanding toolbox and benchmark.
+- [MMTracking](https://github.com/open-mmlab/mmtracking): OpenMMLab video perception toolbox and benchmark.
+- [MMFlow](https://github.com/open-mmlab/mmflow): OpenMMLab optical flow toolbox and benchmark.
+- [MMagic](https://github.com/open-mmlab/mmagic): Open**MM**Lab **A**dvanced, **G**enerative and **I**ntelligent **C**reation toolbox.
+- [MMGeneration](https://github.com/open-mmlab/mmgeneration): OpenMMLab image and video generative models toolbox.
+- [MMDeploy](https://github.com/open-mmlab/mmdeploy): OpenMMLab model deployment framework.
+- [Playground](https://github.com/open-mmlab/playground): A central hub for gathering and showcasing amazing projects built upon OpenMMLab.
diff --git a/README_zh-CN.md b/README_zh-CN.md
new file mode 100644
index 0000000000000000000000000000000000000000..9ee8dffc401d414c0c2b7135ba2a4887f80608a4
--- /dev/null
+++ b/README_zh-CN.md
@@ -0,0 +1,353 @@
+
+
+

+
+
+
+
+[](https://pypi.org/project/mmpretrain)
+[](https://mmpretrain.readthedocs.io/zh_CN/latest/)
+[](https://github.com/open-mmlab/mmpretrain/actions)
+[](https://codecov.io/gh/open-mmlab/mmpretrain)
+[](https://github.com/open-mmlab/mmpretrain/blob/main/LICENSE)
+[](https://github.com/open-mmlab/mmpretrain/issues)
+[](https://github.com/open-mmlab/mmpretrain/issues)
+
+[📘 中文文档](https://mmpretrain.readthedocs.io/zh_CN/latest/) |
+[🛠️ 安装教程](https://mmpretrain.readthedocs.io/zh_CN/latest/get_started.html) |
+[👀 模型库](https://mmpretrain.readthedocs.io/zh_CN/latest/modelzoo_statistics.html) |
+[🆕 更新日志](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/changelog.html) |
+[🤔 报告问题](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+
+

+
+[English](/README.md) | 简体中文
+
+
+
+
+
+ 
+

+
+ 
+

+
+ 
+

+
+ 
+

+
+ 
+

+
+ 
+
+
+## Introduction
+
+MMPreTrain 是一款基于 PyTorch 的开源深度学习预训练工具箱,是 [OpenMMLab](https://openmmlab.com/) 项目的成员之一
+
+`主分支`代码目前支持 PyTorch 1.8 以上的版本。
+
+### 主要特性
+
+- 支持多样的主干网络与预训练模型
+- 支持多种训练策略(有监督学习,无监督学习,多模态学习等)
+- 提供多种训练技巧
+- 大量的训练配置文件
+- 高效率和高可扩展性
+- 功能强大的工具箱,有助于模型分析和实验
+- 支持多种开箱即用的推理任务
+ - 图像分类
+ - 图像描述(Image Caption)
+ - 视觉问答(Visual Question Answering)
+ - 视觉定位(Visual Grounding)
+ - 检索(图搜图,图搜文,文搜图)
+
+https://github.com/open-mmlab/mmpretrain/assets/26739999/e4dcd3a2-f895-4d1b-a351-fbc74a04e904
+
+## 更新日志
+
+🌟 2024/01/04 发布了 v1.2.0 版本
+
+- 支持了 LLaVA 1.5
+- 实现了一个 RAM 模型的 gradio 推理例程
+
+🌟 2023/10/12 发布了 v1.1.0 版本
+
+- 支持 Mini-GPT4 训练并提供一个基于 Baichuan-7B 的中文模型
+- 支持基于 CLIP 的零样本分类。
+
+🌟 2023/7/4 发布了 v1.0.0 版本
+
+- 支持更多**多模态**算法的推理, 例如 [**LLaVA**](./configs/llava/), [**MiniGPT-4**](./configs/minigpt4), [**Otter**](./configs/otter/) 等。
+- 支持约 **10 个多模态**数据集!
+- 添加自监督学习算法 [**iTPN**](./configs/itpn/), [**SparK**](./configs/spark/)。
+- 提供[新配置文件](./mmpretrain/configs/)和 [DeepSpeed/FSDP](./configs/mae/benchmarks/) 的样例。这是[新配置文件](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#a-pure-python-style-configuration-file-beta) 和 [DeepSpeed/FSDP with FlexibleRunner](https://mmengine.readthedocs.io/en/latest/api/generated/mmengine.runner.FlexibleRunner.html#mmengine.runner.FlexibleRunner) 的文档链接。
+
+🌟 从 MMClassification 升级到 MMPreTrain
+
+- 整合来自 MMSelfSup 的自监督学习算法,例如 `MAE`, `BEiT` 等
+- 支持了 **RIFormer**,简单但有效的视觉主干网络,却移除了 token mixer
+- 重构数据管道可视化
+- 支持了 **LeViT**, **XCiT**, **ViG**, **ConvNeXt-V2**, **EVA**, **RevViT**, **EfficientnetV2**, **CLIP**, **TinyViT** 和 **MixMIM** 等骨干网络结构
+
+这个版本引入一个全新的,可扩展性强的训练和测试引擎,但目前仍在开发中。欢迎根据 [文档](https://mmpretrain.readthedocs.io/zh_CN/latest/) 进行试用。
+
+同时,新版本中存在一些与旧版本不兼容的修改。请查看 [迁移文档](https://mmpretrain.readthedocs.io/zh_CN/latest/migration.html) 来详细了解这些变动。
+
+发布历史和更新细节请参考 [更新日志](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/changelog.html)。
+
+## 安装
+
+以下是安装的简要步骤:
+
+```shell
+conda create -n open-mmlab python=3.8 pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -y
+conda activate open-mmlab
+pip3 install openmim
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+mim install -e .
+```
+
+更详细的步骤请参考 [安装指南](https://mmpretrain.readthedocs.io/zh_CN/latest/get_started.html) 进行安装。
+
+如果需要多模态模型,请使用如下方式安装额外的依赖:
+
+```shell
+mim install -e ".[multimodal]"
+```
+
+## 基础教程
+
+我们为新用户提供了一系列基础教程:
+
+- [学习配置文件](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/config.html)
+- [准备数据集](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/dataset_prepare.html)
+- [使用现有模型推理](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/inference.html)
+- [训练](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/train.html)
+- [测试](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/test.html)
+- [下游任务](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/downstream.html)
+
+关于更多的信息,请查阅我们的 [相关文档](https://mmpretrain.readthedocs.io/zh_CN/latest/)。
+
+## 模型库
+
+相关结果和模型可在 [模型库](https://mmpretrain.readthedocs.io/zh_CN/latest/modelzoo_statistics.html) 中获得。
+
+
+ 概览
+
+
+
+
+ |
+ 支持的主干网络
+ |
+
+ 自监督学习
+ |
+
+ 多模态算法
+ |
+
+ 其它
+ |
+
+
+ |
+
+ |
+
+
+ |
+
+
+ |
+
+ 图像检索任务:
+
+ 训练和测试 Tips:
+
+ |
+
+
+
+## 参与贡献
+
+我们非常欢迎任何有助于提升 MMPreTrain 的贡献,请参考 [贡献指南](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/contribution_guide.html) 来了解如何参与贡献。
+
+## 致谢
+
+MMPreTrain 是一款由不同学校和公司共同贡献的开源项目。我们感谢所有为项目提供算法复现和新功能支持的贡献者,以及提供宝贵反馈的用户。
+我们希望该工具箱和基准测试可以为社区提供灵活的代码工具,供用户复现现有算法并开发自己的新模型,从而不断为开源社区提供贡献。
+
+## 引用
+
+如果你在研究中使用了本项目的代码或者性能基准,请参考如下 bibtex 引用 MMPreTrain。
+
+```BibTeX
+@misc{2023mmpretrain,
+ title={OpenMMLab's Pre-training Toolbox and Benchmark},
+ author={MMPreTrain Contributors},
+ howpublished = {\url{https://github.com/open-mmlab/mmpretrain}},
+ year={2023}
+}
+```
+
+## 许可证
+
+该项目开源自 [Apache 2.0 license](LICENSE).
+
+## OpenMMLab 的其他项目
+
+- [MMEngine](https://github.com/open-mmlab/mmengine): OpenMMLab 深度学习模型训练基础库
+- [MMCV](https://github.com/open-mmlab/mmcv): OpenMMLab 计算机视觉基础库
+- [MIM](https://github.com/open-mmlab/mim): MIM 是 OpenMMlab 项目、算法、模型的统一入口
+- [MMEval](https://github.com/open-mmlab/mmeval): 统一开放的跨框架算法评测库
+- [MMPreTrain](https://github.com/open-mmlab/mmpretrain): OpenMMLab 深度学习预训练工具箱
+- [MMDetection](https://github.com/open-mmlab/mmdetection): OpenMMLab 目标检测工具箱
+- [MMDetection3D](https://github.com/open-mmlab/mmdetection3d): OpenMMLab 新一代通用 3D 目标检测平台
+- [MMRotate](https://github.com/open-mmlab/mmrotate): OpenMMLab 旋转框检测工具箱与测试基准
+- [MMYOLO](https://github.com/open-mmlab/mmyolo): OpenMMLab YOLO 系列工具箱与测试基准
+- [MMSegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab 语义分割工具箱
+- [MMOCR](https://github.com/open-mmlab/mmocr): OpenMMLab 全流程文字检测识别理解工具包
+- [MMPose](https://github.com/open-mmlab/mmpose): OpenMMLab 姿态估计工具箱
+- [MMHuman3D](https://github.com/open-mmlab/mmhuman3d): OpenMMLab 人体参数化模型工具箱与测试基准
+- [MMSelfSup](https://github.com/open-mmlab/mmselfsup): OpenMMLab 自监督学习工具箱与测试基准
+- [MMRazor](https://github.com/open-mmlab/mmrazor): OpenMMLab 模型压缩工具箱与测试基准
+- [MMFewShot](https://github.com/open-mmlab/mmfewshot): OpenMMLab 少样本学习工具箱与测试基准
+- [MMAction2](https://github.com/open-mmlab/mmaction2): OpenMMLab 新一代视频理解工具箱
+- [MMTracking](https://github.com/open-mmlab/mmtracking): OpenMMLab 一体化视频目标感知平台
+- [MMFlow](https://github.com/open-mmlab/mmflow): OpenMMLab 光流估计工具箱与测试基准
+- [MMagic](https://github.com/open-mmlab/mmagic): OpenMMLab 新一代人工智能内容生成(AIGC)工具箱
+- [MMGeneration](https://github.com/open-mmlab/mmgeneration): OpenMMLab 图片视频生成模型工具箱
+- [MMDeploy](https://github.com/open-mmlab/mmdeploy): OpenMMLab 模型部署框架
+- [Playground](https://github.com/open-mmlab/playground): 收集和展示 OpenMMLab 相关的前沿、有趣的社区项目
+
+## 欢迎加入 OpenMMLab 社区
+
+扫描下方的二维码可关注 OpenMMLab 团队的 [知乎官方账号](https://www.zhihu.com/people/openmmlab),扫描下方微信二维码添加喵喵好友,进入 MMPretrain 微信交流社群。【加好友申请格式:研究方向+地区+学校/公司+姓名】
+
+
+

+
+
+我们会在 OpenMMLab 社区为大家
+
+- 📢 分享 AI 框架的前沿核心技术
+- 💻 解读 PyTorch 常用模块源码
+- 📰 发布 OpenMMLab 的相关新闻
+- 🚀 介绍 OpenMMLab 开发的前沿算法
+- 🏃 获取更高效的问题答疑和意见反馈
+- 🔥 提供与各行各业开发者充分交流的平台
+
+干货满满 📘,等你来撩 💗,OpenMMLab 社区期待您的加入 👬
diff --git a/configs/_base_/datasets/cifar100_bs16.py b/configs/_base_/datasets/cifar100_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..67477db0367fa1356c4514a46f4b43d56b4c5822
--- /dev/null
+++ b/configs/_base_/datasets/cifar100_bs16.py
@@ -0,0 +1,45 @@
+# dataset settings
+dataset_type = 'CIFAR100'
+data_preprocessor = dict(
+ num_classes=100,
+ # RGB format normalization parameters
+ mean=[129.304, 124.070, 112.434],
+ std=[68.170, 65.392, 70.418],
+ # loaded images are already RGB format
+ to_rgb=False)
+
+train_pipeline = [
+ dict(type='RandomCrop', crop_size=32, padding=4),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=2,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/cifar100',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=2,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/cifar100/',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/cifar10_bs16.py b/configs/_base_/datasets/cifar10_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..408be35da845a39bf7058eb9c3ce5549295b3822
--- /dev/null
+++ b/configs/_base_/datasets/cifar10_bs16.py
@@ -0,0 +1,45 @@
+# dataset settings
+dataset_type = 'CIFAR10'
+data_preprocessor = dict(
+ num_classes=10,
+ # RGB format normalization parameters
+ mean=[125.307, 122.961, 113.8575],
+ std=[51.5865, 50.847, 51.255],
+ # loaded images are already RGB format
+ to_rgb=False)
+
+train_pipeline = [
+ dict(type='RandomCrop', crop_size=32, padding=4),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=2,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/cifar10',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=2,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/cifar10/',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_caption.py b/configs/_base_/datasets/coco_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..5346111273d4120581fe854583c99f6b94e7e873
--- /dev/null
+++ b/configs/_base_/datasets/coco_caption.py
@@ -0,0 +1,70 @@
+# data settings
+# coco caption annotations can be grabbed from LAVIS repo
+# https://github.com/salesforce/LAVIS/blob/main/lavis/configs/datasets/coco/defaults_cap.yaml
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='CleanCaption', keys='gt_caption'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['gt_caption'],
+ meta_keys=['image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(384, 384),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type='COCOCaption',
+ data_root='data/coco',
+ ann_file='annotations/coco_karpathy_train.json',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type='COCOCaption',
+ data_root='data/coco',
+ ann_file='annotations/coco_karpathy_val.json',
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+val_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# # If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_okvqa.py b/configs/_base_/datasets/coco_okvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..16f1577dbb5e5c7c14186f2523e94e0aeffc4b54
--- /dev/null
+++ b/configs/_base_/datasets/coco_okvqa.py
@@ -0,0 +1,75 @@
+# data settings
+
+data_preprocessor = dict(
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(480, 480),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='CleanCaption',
+ keys=['question'],
+ ),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='COCOVQA',
+ data_root='data/coco',
+ data_prefix='train2014',
+ question_file=
+ 'annotations/okvqa_OpenEnded_mscoco_train2014_questions.json',
+ ann_file='annotations/okvqa_mscoco_train2014_annotations.json',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='COCOVQA',
+ data_root='data/coco',
+ data_prefix='val2014',
+ question_file=
+ 'annotations/okvqa_OpenEnded_mscoco_val2014_questions.json',
+ ann_file='annotations/okvqa_mscoco_val2014_annotations.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_retrieval.py b/configs/_base_/datasets/coco_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f6b802a3854fd029c476d78296edbc9bffd4e75
--- /dev/null
+++ b/configs/_base_/datasets/coco_retrieval.py
@@ -0,0 +1,99 @@
+# data settings
+# Here are the links to download the annotations for coco retrieval for conveniency # noqa
+# https://download.openmmlab.com/mmclassification/datasets/coco_retrieval/caption_karpathy_train2014.json
+# https://download.openmmlab.com/mmclassification/datasets/coco_retrieval/caption_karpathy_val2014.json
+# https://download.openmmlab.com/mmclassification/datasets/coco_retrieval/caption_karpathy_test2014.json
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+rand_increasing_policies = [
+ dict(type='AutoContrast'),
+ dict(type='Equalize'),
+ dict(type='Rotate', magnitude_key='angle', magnitude_range=(0, 30)),
+ dict(
+ type='Brightness', magnitude_key='magnitude',
+ magnitude_range=(0, 0.0)),
+ dict(type='Sharpness', magnitude_key='magnitude', magnitude_range=(0, 0)),
+ dict(
+ type='Shear',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ direction='horizontal'),
+ dict(
+ type='Shear',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ direction='vertical'),
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ crop_ratio_range=(0.5, 1.0),
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies=rand_increasing_policies,
+ num_policies=2,
+ magnitude_level=5),
+ dict(type='CleanCaption', keys='text'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text', 'is_matched'],
+ meta_keys=['image_id']),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(384, 384),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CleanCaption', keys='text'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text', 'gt_text_id', 'gt_image_id'],
+ meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=16,
+ dataset=dict(
+ type='COCORetrieval',
+ data_root='data/coco',
+ ann_file='annotations/caption_karpathy_train2014.json',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=16,
+ dataset=dict(
+ type='COCORetrieval',
+ data_root='data/coco',
+ ann_file='annotations/caption_karpathy_val2014.json',
+ pipeline=test_pipeline,
+ # This is required for evaluation
+ test_mode=True,
+ ),
+ sampler=dict(type='SequentialSampler', subsample_type='sequential'),
+ persistent_workers=True,
+)
+
+val_evaluator = dict(type='RetrievalRecall', topk=(1, 5, 10))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_vg_vqa.py b/configs/_base_/datasets/coco_vg_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ba0eac46853c1a477e2c6b2bc3dcddbbf7e5423
--- /dev/null
+++ b/configs/_base_/datasets/coco_vg_vqa.py
@@ -0,0 +1,96 @@
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=(480, 480),
+ crop_ratio_range=(0.5, 1.0),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='simple_increasing', # slightly different from LAVIS
+ num_policies=2,
+ magnitude_level=5),
+ dict(type='CleanCaption', keys=['question', 'gt_answer']),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight']),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(480, 480),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CleanCaption', keys=['question']),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question'],
+ meta_keys=['question_id']),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='ConcatDataset',
+ datasets=[
+ # VQAv2 train
+ dict(
+ type='COCOVQA',
+ data_root='data/coco',
+ data_prefix='train2014',
+ question_file=
+ 'annotations/v2_OpenEnded_mscoco_train2014_questions.json',
+ ann_file='annotations/v2_mscoco_train2014_annotations.json',
+ pipeline=train_pipeline,
+ ),
+ # VQAv2 val
+ dict(
+ type='COCOVQA',
+ data_root='data/coco',
+ data_prefix='val2014',
+ question_file=
+ 'annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+ ann_file='annotations/v2_mscoco_val2014_annotations.json',
+ pipeline=train_pipeline,
+ ),
+ # Visual Genome
+ dict(
+ type='VisualGenomeQA',
+ data_root='visual_genome',
+ data_prefix='image',
+ ann_file='question_answers.json',
+ pipeline=train_pipeline,
+ )
+ ]),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+test_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='COCOVQA',
+ data_root='data/coco',
+ data_prefix='test2015',
+ question_file=
+ 'annotations/v2_OpenEnded_mscoco_test2015_questions.json', # noqa: E501
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test.json')
diff --git a/configs/_base_/datasets/coco_vqa.py b/configs/_base_/datasets/coco_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..7fb16bd241b357a897b168ceff5450b6e7f2dc80
--- /dev/null
+++ b/configs/_base_/datasets/coco_vqa.py
@@ -0,0 +1,84 @@
+# data settings
+
+data_preprocessor = dict(
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(480, 480),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='CleanCaption',
+ keys=['question'],
+ ),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='COCOVQA',
+ data_root='data/coco',
+ data_prefix='train2014',
+ question_file=
+ 'annotations/v2_OpenEnded_mscoco_train2014_questions.json',
+ ann_file='annotations/v2_mscoco_train2014_annotations.json',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='COCOVQA',
+ data_root='data/coco',
+ data_prefix='val2014',
+ question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+ ann_file='annotations/v2_mscoco_val2014_annotations.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='COCOVQA',
+ data_root='data/coco',
+ data_prefix='test2015',
+ question_file= # noqa: E251
+ 'annotations/v2_OpenEnded_mscoco_test2015_questions.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test.json')
diff --git a/configs/_base_/datasets/cub_bs8_384.py b/configs/_base_/datasets/cub_bs8_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..24b3a9ffd4df6987716f15a42cc2e3d02c436b90
--- /dev/null
+++ b/configs/_base_/datasets/cub_bs8_384.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'CUB'
+data_preprocessor = dict(
+ num_classes=200,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=510),
+ dict(type='RandomCrop', crop_size=384),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=510),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=8,
+ num_workers=2,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/CUB_200_2011',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=2,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/CUB_200_2011',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/cub_bs8_448.py b/configs/_base_/datasets/cub_bs8_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0bc7b7e1fbd308763c68e1b6302669c705e8f41
--- /dev/null
+++ b/configs/_base_/datasets/cub_bs8_448.py
@@ -0,0 +1,50 @@
+# dataset settings
+dataset_type = 'CUB'
+data_preprocessor = dict(
+ num_classes=200,
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=600),
+ dict(type='RandomCrop', crop_size=448),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=600),
+ dict(type='CenterCrop', crop_size=448),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=8,
+ num_workers=2,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/CUB_200_2011',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=2,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/CUB_200_2011',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/flickr30k_caption.py b/configs/_base_/datasets/flickr30k_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..a902b5291f1df0df719f570538385a1c75dfccfd
--- /dev/null
+++ b/configs/_base_/datasets/flickr30k_caption.py
@@ -0,0 +1,92 @@
+# data settings
+
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='CleanCaption', keys='gt_caption'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['gt_caption'],
+ meta_keys=['image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(384, 384),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type='Flickr30kCaption',
+ data_root='data/flickr30k',
+ ann_file='annotations/dataset_flickr30k.json',
+ data_prefix='images',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type='Flickr30kCaption',
+ data_root='data/flickr30k',
+ ann_file='annotations/dataset_flickr30k.json',
+ data_prefix='images',
+ split='val',
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+# refer tools/dataset_converters/convert_flickr30k_ann.py
+val_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/flickr30k_val_gt.json',
+)
+
+# # If you want standard test, please manually configure the test dataset
+test_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type='Flickr30kCaption',
+ data_root='data/flickr30k',
+ ann_file='annotations/dataset_flickr30k.json',
+ data_prefix='images',
+ split='test',
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+# refer tools/dataset_converters/convert_flickr30k_ann.py
+test_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/flickr30k_test_gt.json',
+)
diff --git a/configs/_base_/datasets/flickr30k_retrieval.py b/configs/_base_/datasets/flickr30k_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..acbc645b92214599d77cd9f3ecc70e9b7235b8e5
--- /dev/null
+++ b/configs/_base_/datasets/flickr30k_retrieval.py
@@ -0,0 +1,112 @@
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+rand_increasing_policies = [
+ dict(type='AutoContrast'),
+ dict(type='Equalize'),
+ dict(type='Rotate', magnitude_key='angle', magnitude_range=(0, 30)),
+ dict(
+ type='Brightness', magnitude_key='magnitude',
+ magnitude_range=(0, 0.0)),
+ dict(type='Sharpness', magnitude_key='magnitude', magnitude_range=(0, 0)),
+ dict(
+ type='Shear',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ direction='horizontal'),
+ dict(
+ type='Shear',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ direction='vertical'),
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ crop_ratio_range=(0.5, 1.0),
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies=rand_increasing_policies,
+ num_policies=2,
+ magnitude_level=5),
+ dict(type='CleanCaption', keys='text'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text', 'is_matched'],
+ meta_keys=['image_id']),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(384, 384),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CleanCaption', keys='text'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text', 'gt_text_id', 'gt_image_id'],
+ meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=16,
+ dataset=dict(
+ type='Flickr30kRetrieval',
+ data_root='data/flickr30k',
+ ann_file='annotations/dataset_flickr30k.json',
+ data_prefix='images',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=16,
+ dataset=dict(
+ type='Flickr30kRetrieval',
+ data_root='data/flickr30k',
+ ann_file='annotations/dataset_flickr30k.json',
+ data_prefix='images',
+ split='val',
+ pipeline=test_pipeline,
+ test_mode=True, # This is required for evaluation
+ ),
+ sampler=dict(type='SequentialSampler', subsample_type='sequential'),
+ persistent_workers=True,
+)
+
+val_evaluator = dict(type='RetrievalRecall', topk=(1, 5, 10))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = dict(
+ batch_size=64,
+ num_workers=16,
+ dataset=dict(
+ type='Flickr30kRetrieval',
+ data_root='data/flickr30k',
+ ann_file='annotations/dataset_flickr30k.json',
+ data_prefix='images',
+ split='test',
+ pipeline=test_pipeline,
+ test_mode=True, # This is required for evaluation
+ ),
+ sampler=dict(type='SequentialSampler', subsample_type='sequential'),
+ persistent_workers=True,
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/gqa.py b/configs/_base_/datasets/gqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..872ab451f32dd9cff87890c943a5ed1dc7ecb517
--- /dev/null
+++ b/configs/_base_/datasets/gqa.py
@@ -0,0 +1,81 @@
+# data settings
+
+data_preprocessor = dict(
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(480, 480),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='CleanCaption',
+ keys=['question'],
+ ),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='GQA',
+ data_root='data/gqa',
+ data_prefix='images',
+ ann_file='annotations/train_balanced_questions.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='GQA',
+ data_root='data/gqa',
+ data_prefix='images',
+ ann_file='annotations/testdev_balanced_questions.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='GQAAcc')
+
+test_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='GQA',
+ data_root='data/gqa',
+ data_prefix='images',
+ ann_file='annotations/testdev_balanced_questions.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet21k_bs128.py b/configs/_base_/datasets/imagenet21k_bs128.py
new file mode 100644
index 0000000000000000000000000000000000000000..38bfd351bf8f49ae18d21492c6fc656a7b2ecc45
--- /dev/null
+++ b/configs/_base_/datasets/imagenet21k_bs128.py
@@ -0,0 +1,28 @@
+# dataset settings
+dataset_type = 'ImageNet21k'
+data_preprocessor = dict(
+ num_classes=21842,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet21k',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
diff --git a/configs/_base_/datasets/imagenet_bs128_mbv3.py b/configs/_base_/datasets/imagenet_bs128_mbv3.py
new file mode 100644
index 0000000000000000000000000000000000000000..d355f507bf8e2be5d9efc3cc777e9854196b9d64
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_mbv3.py
@@ -0,0 +1,66 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='AutoAugment',
+ policies='imagenet',
+ hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.2,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_poolformer_medium_224.py b/configs/_base_/datasets/imagenet_bs128_poolformer_medium_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..be90a655674e22c3341c185c7be5532b1bef8cf1
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_poolformer_medium_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=236,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_poolformer_small_224.py b/configs/_base_/datasets/imagenet_bs128_poolformer_small_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9e0f071ade1feccf6a3f96ef7ad8f28c693e84c
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_poolformer_small_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=248,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_revvit_224.py b/configs/_base_/datasets/imagenet_bs128_revvit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd87aaf033b08dd94b5a684eed759072ff6fd4e9
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_revvit_224.py
@@ -0,0 +1,83 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=7,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand', # should be 'pixel', but currently not supported
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_riformer_medium_384.py b/configs/_base_/datasets/imagenet_bs128_riformer_medium_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..151ded7895b378ba7e6bf5895fb11d903841b95d
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_riformer_medium_384.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=404,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_riformer_small_384.py b/configs/_base_/datasets/imagenet_bs128_riformer_small_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea9799ba9c41fcbaf049a54d9776750c860a598c
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_riformer_small_384.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=426,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_vig_224.py b/configs/_base_/datasets/imagenet_bs128_vig_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..abb0182a6ce53202bee905bcd3849b851852b4b4
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_vig_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=248,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=128,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_196.py b/configs/_base_/datasets/imagenet_bs16_eva_196.py
new file mode 100644
index 0000000000000000000000000000000000000000..f668e1d6e56ab4c5e311af912fe4b560a3a12bfd
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_196.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=196,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=196,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=196),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_336.py b/configs/_base_/datasets/imagenet_bs16_eva_336.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2c770af0f58a4db5d0435807f3cc9b499d01295
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_336.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=336,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=336,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=336),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_448.py b/configs/_base_/datasets/imagenet_bs16_eva_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..b90bba14eefb3c7e0bac8234dd84461a7b420462
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_448.py
@@ -0,0 +1,62 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=448,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=448,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=448),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ ann_file='meta/train.txt',
+ data_prefix='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ ann_file='meta/val.txt',
+ data_prefix='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_560.py b/configs/_base_/datasets/imagenet_bs16_eva_560.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e548cc2a8de33fcd8ec80a2652dabcb931519aa
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_560.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=560,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=560,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=560),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_pil_bicubic_384.py b/configs/_base_/datasets/imagenet_bs16_pil_bicubic_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..8507af4dd0219d8aa6449b6b3d9a1f8d39f1bfce
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_pil_bicubic_384.py
@@ -0,0 +1,53 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=384, backend='pillow', interpolation='bicubic'),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_beitv2.py b/configs/_base_/datasets/imagenet_bs256_beitv2.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d420326f2cf3e26f1478d684a03e39c51799534
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_beitv2.py
@@ -0,0 +1,47 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='TwoNormDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ second_mean=[127.5, 127.5, 127.5],
+ second_std=[127.5, 127.5, 127.5],
+ to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.4,
+ hue=0.),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandomResizedCropAndInterpolationWithTwoPic',
+ size=224,
+ second_size=224,
+ interpolation='bicubic',
+ second_interpolation='bicubic',
+ scale=(0.2, 1.0)),
+ dict(
+ type='BEiTMaskGenerator',
+ input_size=(14, 14),
+ num_masking_patches=75,
+ max_num_patches=75,
+ min_num_patches=16),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ split='train',
+ pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs256_davit_224.py b/configs/_base_/datasets/imagenet_bs256_davit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ea0a8382d8feaae6f39808b6b1193684294f918
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_davit_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=236,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_itpn.py b/configs/_base_/datasets/imagenet_bs256_itpn.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b51c47272a99c4257a8c98dfe0b2bb8652e54a4
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_itpn.py
@@ -0,0 +1,49 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='TwoNormDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # clip mean & std
+ second_mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ second_std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.4,
+ hue=0.),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandomResizedCropAndInterpolationWithTwoPic',
+ size=224,
+ second_size=224,
+ interpolation='bicubic',
+ second_interpolation='bicubic',
+ scale=(0.2, 1.0)),
+ dict(
+ type='BEiTMaskGenerator',
+ input_size=(14, 14),
+ num_masking_patches=75,
+ max_num_patches=75,
+ min_num_patches=16),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/train.txt',
+ data_prefix=dict(img_path='train/'),
+ pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs256_levit_224.py b/configs/_base_/datasets/imagenet_bs256_levit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..612db7d7f0777ba50c78c084be8db7ba57266942
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_levit_224.py
@@ -0,0 +1,80 @@
+dataset_type = 'ImageNet'
+
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=4,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=256,
+ num_workers=4,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_rsb_a12.py b/configs/_base_/datasets/imagenet_bs256_rsb_a12.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab59d9e42fea20b316f306023c86c7b75acdb80f
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_rsb_a12.py
@@ -0,0 +1,72 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=7,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=236,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=256,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_rsb_a3.py b/configs/_base_/datasets/imagenet_bs256_rsb_a3.py
new file mode 100644
index 0000000000000000000000000000000000000000..02e34497d8ba68416cab4b08b8347a9781899a4f
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_rsb_a3.py
@@ -0,0 +1,72 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=6,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=236,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=256,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_simmim_192.py b/configs/_base_/datasets/imagenet_bs256_simmim_192.py
new file mode 100644
index 0000000000000000000000000000000000000000..45062e9c28bac95737e4783c80f353870343b6f2
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_simmim_192.py
@@ -0,0 +1,33 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='SelfSupDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=192, crop_ratio_range=(0.67, 1.0)),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='SimMIMMaskGenerator',
+ input_size=192,
+ mask_patch_size=32,
+ model_patch_size=4,
+ mask_ratio=0.6),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ split='train',
+ pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs256_swin_192.py b/configs/_base_/datasets/imagenet_bs256_swin_192.py
new file mode 100644
index 0000000000000000000000000000000000000000..11c2cb2a82ec320f18b21c89e2bd455a51912c24
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_swin_192.py
@@ -0,0 +1,81 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=192,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=219,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=192),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=8,
+ collate_fn=dict(type='default_collate'),
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ split='train',
+ pipeline=train_pipeline),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ collate_fn=dict(type='default_collate'),
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ split='val',
+ pipeline=test_pipeline),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32.py b/configs/_base_/datasets/imagenet_bs32.py
new file mode 100644
index 0000000000000000000000000000000000000000..a069bb9c3317079e2d7cdec8c8573ad0c7d42470
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32_byol.py b/configs/_base_/datasets/imagenet_bs32_byol.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7235b3be6fbfb79bcdc7179aef0bcd906475a68
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_byol.py
@@ -0,0 +1,89 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='SelfSupDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+
+view_pipeline1 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=1.),
+ dict(type='Solarize', thr=128, prob=0.),
+]
+view_pipeline2 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=0.1),
+ dict(type='Solarize', thr=128, prob=0.2)
+]
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='MultiView',
+ num_views=[1, 1],
+ transforms=[view_pipeline1, view_pipeline2]),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=4,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ split='train',
+ pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs32_mocov2.py b/configs/_base_/datasets/imagenet_bs32_mocov2.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc60050dc748f3f28e0b68c83a1fd0910503039b
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_mocov2.py
@@ -0,0 +1,58 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='SelfSupDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+
+# The difference between mocov2 and mocov1 is the transforms in the pipeline
+view_pipeline = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.2, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.4,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=0.5),
+ dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='MultiView', num_views=2, transforms=[view_pipeline]),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ drop_last=True,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ split='train',
+ pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs32_pil_bicubic.py b/configs/_base_/datasets/imagenet_bs32_pil_bicubic.py
new file mode 100644
index 0000000000000000000000000000000000000000..36880ff76abd2329199801f807ec3bb0469ec140
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_pil_bicubic.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32_pil_resize.py b/configs/_base_/datasets/imagenet_bs32_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9afc5cb0ed9fa7941b17fdfdae792b54adc9608
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_pil_resize.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32_simclr.py b/configs/_base_/datasets/imagenet_bs32_simclr.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e487b00b164eb964cfb4159a6918eb55d2b404e
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_simclr.py
@@ -0,0 +1,52 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='SelfSupDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+
+view_pipeline = [
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.8,
+ contrast=0.8,
+ saturation=0.8,
+ hue=0.2)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=0.5),
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='MultiView', num_views=2, transforms=[view_pipeline]),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=4,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ split='train',
+ pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs512_mae.py b/configs/_base_/datasets/imagenet_bs512_mae.py
new file mode 100644
index 0000000000000000000000000000000000000000..03d350eb0024a872e53f7d95ab7f3f12c4e70a25
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs512_mae.py
@@ -0,0 +1,32 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='SelfSupDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.2, 1.0),
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=512,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ split='train',
+ pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs512_mocov3.py b/configs/_base_/datasets/imagenet_bs512_mocov3.py
new file mode 100644
index 0000000000000000000000000000000000000000..1679f636e316a229744d8d79b8cda5c92e2b1450
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs512_mocov3.py
@@ -0,0 +1,90 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='SelfSupDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+
+view_pipeline1 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.2, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=1.),
+ dict(type='Solarize', thr=128, prob=0.),
+ dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.2, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=0.1),
+ dict(type='Solarize', thr=128, prob=0.2),
+ dict(type='RandomFlip', prob=0.5),
+]
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='MultiView',
+ num_views=[1, 1],
+ transforms=[view_pipeline1, view_pipeline2]),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=512,
+ num_workers=8,
+ persistent_workers=True,
+ pin_memory=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ split='train',
+ pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs64.py b/configs/_base_/datasets/imagenet_bs64.py
new file mode 100644
index 0000000000000000000000000000000000000000..73e6d54bdde5523604dca93a8731765b4def92db
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_autoaug.py b/configs/_base_/datasets/imagenet_bs64_autoaug.py
new file mode 100644
index 0000000000000000000000000000000000000000..3160b8cf2afaa05cd49e09cabade7f4716bbd23d
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_autoaug.py
@@ -0,0 +1,59 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='AutoAugment',
+ policies='imagenet',
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_clip_224.py b/configs/_base_/datasets/imagenet_bs64_clip_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..c200601ba45e7a1f317803e7c6f8c0ba34355623
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_clip_224.py
@@ -0,0 +1,73 @@
+# dataset settings
+dataset_type = 'ImageNet'
+img_norm_cfg = dict(
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=True)
+image_size = 224
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ size=image_size,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+ # dict(
+ # type='RandAugment',
+ # policies={{_base_.rand_increasing_policies}},
+ # num_policies=2,
+ # total_level=10,
+ # magnitude_level=9,
+ # magnitude_std=0.5,
+ # hparams=dict(
+ # pad_val=[round(x) for x in img_norm_cfg['mean'][::-1]],
+ # interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=img_norm_cfg['mean'][::-1],
+ fill_std=img_norm_cfg['std'][::-1]),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='ImageToTensor', keys=['img']),
+ dict(type='ToTensor', keys=['gt_label']),
+ dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ size=(image_size, -1),
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=image_size),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='ImageToTensor', keys=['img']),
+ dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+ samples_per_gpu=64,
+ workers_per_gpu=8,
+ train=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ val=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ test=dict(
+ # replace `data/val` with `data/test` for standard test
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline))
+
+evaluation = dict(interval=10, metric='accuracy')
diff --git a/configs/_base_/datasets/imagenet_bs64_clip_384.py b/configs/_base_/datasets/imagenet_bs64_clip_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7caee678774a3baa1481163fe89fe35ee5e9b96
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_clip_384.py
@@ -0,0 +1,73 @@
+# dataset settings
+dataset_type = 'ImageNet'
+img_norm_cfg = dict(
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=True)
+image_size = 384
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ size=image_size,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+ # dict(
+ # type='RandAugment',
+ # policies={{_base_.rand_increasing_policies}},
+ # num_policies=2,
+ # total_level=10,
+ # magnitude_level=9,
+ # magnitude_std=0.5,
+ # hparams=dict(
+ # pad_val=[round(x) for x in img_norm_cfg['mean'][::-1]],
+ # interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=img_norm_cfg['mean'][::-1],
+ fill_std=img_norm_cfg['std'][::-1]),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='ImageToTensor', keys=['img']),
+ dict(type='ToTensor', keys=['gt_label']),
+ dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ size=(image_size, -1),
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=image_size),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='ImageToTensor', keys=['img']),
+ dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+ samples_per_gpu=64,
+ workers_per_gpu=8,
+ train=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ val=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ test=dict(
+ # replace `data/val` with `data/test` for standard test
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline))
+
+evaluation = dict(interval=10, metric='accuracy')
diff --git a/configs/_base_/datasets/imagenet_bs64_clip_448.py b/configs/_base_/datasets/imagenet_bs64_clip_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..32a92ef66a30d6caff7d399fb321ec9283965920
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_clip_448.py
@@ -0,0 +1,74 @@
+# dataset settings
+dataset_type = 'ImageNet'
+img_norm_cfg = dict(
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=True)
+image_size = 448
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ size=image_size,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+ # dict(
+ # type='RandAugment',
+ # policies={{_base_.rand_increasing_policies}},
+ # num_policies=2,
+ # total_level=10,
+ # magnitude_level=9,
+ # magnitude_std=0.5,
+ # hparams=dict(
+ # pad_val=[round(x) for x in img_norm_cfg['mean'][::-1]],
+ # interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=img_norm_cfg['mean'][::-1],
+ fill_std=img_norm_cfg['std'][::-1]),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='ImageToTensor', keys=['img']),
+ dict(type='ToTensor', keys=['gt_label']),
+ dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ size=(image_size, -1),
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=image_size),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='ImageToTensor', keys=['img']),
+ dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+ samples_per_gpu=64,
+ workers_per_gpu=8,
+ train=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ val=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ test=dict(
+ # replace `data/val` with `data/test` for standard test
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline))
+
+evaluation = dict(interval=10, metric='accuracy')
diff --git a/configs/_base_/datasets/imagenet_bs64_convmixer_224.py b/configs/_base_/datasets/imagenet_bs64_convmixer_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e9c0aa0f9bfc8883f3ee5d58464c8ea97f5e3bc
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_convmixer_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs')
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=233,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_deit3_224.py b/configs/_base_/datasets/imagenet_bs64_deit3_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e460a4d95a21d2ca3c3d6bb0d65e5c5409c14ff
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_deit3_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=224,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_deit3_384.py b/configs/_base_/datasets/imagenet_bs64_deit3_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc554ddba1d6a32a83638e7c2d58d27c345a4909
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_deit3_384.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=384,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_edgenext_256.py b/configs/_base_/datasets/imagenet_bs64_edgenext_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..7db9e4ef5f26691e364d244df0729827bf356293
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_edgenext_256.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=256,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=292,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_hivit_224.py b/configs/_base_/datasets/imagenet_bs64_hivit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c258d7ab50ac74c3b2bb30a852f8f38a0f10b83
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_hivit_224.py
@@ -0,0 +1,83 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/train.txt',
+ data_prefix='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/val.txt',
+ data_prefix='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_mixer_224.py b/configs/_base_/datasets/imagenet_bs64_mixer_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b92a5141b5d3c0784216c83effb7b171c631fccc
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_mixer_224.py
@@ -0,0 +1,52 @@
+# dataset settings
+dataset_type = 'ImageNet'
+
+# Google research usually use the below normalization setting.
+data_preprocessor = dict(
+ num_classes=1000,
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short', interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_pil_resize.py b/configs/_base_/datasets/imagenet_bs64_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..79f9325b022ac8b9219134a3b1ef47b584fcf3b2
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_pil_resize.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py b/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py
new file mode 100644
index 0000000000000000000000000000000000000000..c25906716c651d63440e1adeed66303ad7dae233
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py
@@ -0,0 +1,68 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='AutoAugment',
+ policies='imagenet',
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_swin_224.py b/configs/_base_/datasets/imagenet_bs64_swin_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e8786eb0feb5cade66d01b6ce99b4240e11918b
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_swin_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_swin_256.py b/configs/_base_/datasets/imagenet_bs64_swin_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ecb41ba4d69c25ddc70469de440a0fde681fbc7
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_swin_256.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=256,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=292, # ( 256 / 224 * 256 )
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_swin_384.py b/configs/_base_/datasets/imagenet_bs64_swin_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..11264f808c1d154c80f5609fbe25e1e7e69a5c88
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_swin_384.py
@@ -0,0 +1,54 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=384, backend='pillow', interpolation='bicubic'),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_t2t_224.py b/configs/_base_/datasets/imagenet_bs64_t2t_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a2dc10f85647fd20afd26d07a2c87a3e3a36962
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_t2t_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=248,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs8_pil_bicubic_320.py b/configs/_base_/datasets/imagenet_bs8_pil_bicubic_320.py
new file mode 100644
index 0000000000000000000000000000000000000000..7160084e56b44205d92a8266fc78ff51bf2a7b4c
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs8_pil_bicubic_320.py
@@ -0,0 +1,59 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[122.5, 122.5, 122.5],
+ std=[122.5, 122.5, 122.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=320,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=int(320 / 224 * 256),
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=320),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=8,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/inshop_bs32_448.py b/configs/_base_/datasets/inshop_bs32_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9772fa665d4a5a3abae575a8fc61fb9f360cd0e
--- /dev/null
+++ b/configs/_base_/datasets/inshop_bs32_448.py
@@ -0,0 +1,64 @@
+# dataset settings
+dataset_type = 'InShop'
+data_preprocessor = dict(
+ num_classes=3997,
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=512),
+ dict(type='RandomCrop', crop_size=448),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=512),
+ dict(type='CenterCrop', crop_size=448),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=4,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/inshop',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+query_dataloader = dict(
+ batch_size=32,
+ num_workers=4,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/inshop',
+ split='query',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+
+gallery_dataloader = dict(
+ batch_size=32,
+ num_workers=4,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/inshop',
+ split='gallery',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_dataloader = query_dataloader
+val_evaluator = [
+ dict(type='RetrievalRecall', topk=1),
+ dict(type='RetrievalAveragePrecision', topk=10),
+]
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/nlvr2.py b/configs/_base_/datasets/nlvr2.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f5314bcd14d9e4f79898411e9c687470e31ac02
--- /dev/null
+++ b/configs/_base_/datasets/nlvr2.py
@@ -0,0 +1,86 @@
+# dataset settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(
+ type='ApplyToList',
+ # NLVR requires to load two images in task.
+ scatter_key='img_path',
+ transforms=[
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ ],
+ collate_keys=['img', 'scale_factor', 'ori_shape'],
+ ),
+ dict(type='CleanCaption', keys='text'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text'],
+ meta_keys=['image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(
+ type='ApplyToList',
+ # NLVR requires to load two images in task.
+ scatter_key='img_path',
+ transforms=[
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(384, 384),
+ interpolation='bicubic',
+ backend='pillow'),
+ ],
+ collate_keys=['img', 'scale_factor', 'ori_shape'],
+ ),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text'],
+ meta_keys=['image_id'],
+ ),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='NLVR2',
+ data_root='data/nlvr2',
+ ann_file='dev.json',
+ data_prefix='dev',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=8,
+ dataset=dict(
+ type='NLVR2',
+ data_root='data/nlvr2',
+ ann_file='dev.json',
+ data_prefix='dev',
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='Accuracy')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/nocaps.py b/configs/_base_/datasets/nocaps.py
new file mode 100644
index 0000000000000000000000000000000000000000..5176671f2b9335b12127c7b58b2626eec12476ea
--- /dev/null
+++ b/configs/_base_/datasets/nocaps.py
@@ -0,0 +1,41 @@
+# data settings
+
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(384, 384),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type='NoCaps',
+ data_root='data/nocaps/',
+ data_prefix=dict(img_path='images/'),
+ ann_file='annotations/nocaps_val_4500_captions.json',
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+val_evaluator = dict(
+ type='NocapsSave',
+ save_dir='./',
+)
+
+# # If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/ocrvqa.py b/configs/_base_/datasets/ocrvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..09e6e3536141f8ea901d2e5bb3070c23d816e8bc
--- /dev/null
+++ b/configs/_base_/datasets/ocrvqa.py
@@ -0,0 +1,81 @@
+# data settings
+
+data_preprocessor = dict(
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CleanCaption', keys=['question', 'gt_answer']),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=[],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(480, 480),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CleanCaption', keys=['question', 'gt_answer']),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=[],
+ ),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='OCRVQA',
+ data_root='data/ocrvqa',
+ data_prefix='images',
+ ann_file='annotations/dataset.json',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=8,
+ dataset=dict(
+ type='OCRVQA',
+ data_root='data/ocrvqa',
+ data_prefix='images',
+ ann_file='annotations/dataset.json',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+ batch_size=64,
+ num_workers=8,
+ dataset=dict(
+ type='OCRVQA',
+ data_root='data/ocrvqa',
+ data_prefix='images',
+ ann_file='annotations/dataset.json',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='VQAAcc')
diff --git a/configs/_base_/datasets/pipelines/auto_aug.py b/configs/_base_/datasets/pipelines/auto_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a10f7eec61ea40336698118342939470f73d052
--- /dev/null
+++ b/configs/_base_/datasets/pipelines/auto_aug.py
@@ -0,0 +1,96 @@
+# Policy for ImageNet, refers to
+# https://github.com/DeepVoltaire/AutoAugment/blame/master/autoaugment.py
+policy_imagenet = [
+ [
+ dict(type='Posterize', bits=4, prob=0.4),
+ dict(type='Rotate', angle=30., prob=0.6)
+ ],
+ [
+ dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
+ dict(type='AutoContrast', prob=0.6)
+ ],
+ [dict(type='Equalize', prob=0.8),
+ dict(type='Equalize', prob=0.6)],
+ [
+ dict(type='Posterize', bits=5, prob=0.6),
+ dict(type='Posterize', bits=5, prob=0.6)
+ ],
+ [
+ dict(type='Equalize', prob=0.4),
+ dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
+ ],
+ [
+ dict(type='Equalize', prob=0.4),
+ dict(type='Rotate', angle=30 / 9 * 8, prob=0.8)
+ ],
+ [
+ dict(type='Solarize', thr=256 / 9 * 6, prob=0.6),
+ dict(type='Equalize', prob=0.6)
+ ],
+ [dict(type='Posterize', bits=6, prob=0.8),
+ dict(type='Equalize', prob=1.)],
+ [
+ dict(type='Rotate', angle=10., prob=0.2),
+ dict(type='Solarize', thr=256 / 9, prob=0.6)
+ ],
+ [
+ dict(type='Equalize', prob=0.6),
+ dict(type='Posterize', bits=5, prob=0.4)
+ ],
+ [
+ dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
+ dict(type='ColorTransform', magnitude=0., prob=0.4)
+ ],
+ [
+ dict(type='Rotate', angle=30., prob=0.4),
+ dict(type='Equalize', prob=0.6)
+ ],
+ [dict(type='Equalize', prob=0.0),
+ dict(type='Equalize', prob=0.8)],
+ [dict(type='Invert', prob=0.6),
+ dict(type='Equalize', prob=1.)],
+ [
+ dict(type='ColorTransform', magnitude=0.4, prob=0.6),
+ dict(type='Contrast', magnitude=0.8, prob=1.)
+ ],
+ [
+ dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
+ dict(type='ColorTransform', magnitude=0.2, prob=1.)
+ ],
+ [
+ dict(type='ColorTransform', magnitude=0.8, prob=0.8),
+ dict(type='Solarize', thr=256 / 9 * 2, prob=0.8)
+ ],
+ [
+ dict(type='Sharpness', magnitude=0.7, prob=0.4),
+ dict(type='Invert', prob=0.6)
+ ],
+ [
+ dict(
+ type='Shear',
+ magnitude=0.3 / 9 * 5,
+ prob=0.6,
+ direction='horizontal'),
+ dict(type='Equalize', prob=1.)
+ ],
+ [
+ dict(type='ColorTransform', magnitude=0., prob=0.4),
+ dict(type='Equalize', prob=0.6)
+ ],
+ [
+ dict(type='Equalize', prob=0.4),
+ dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
+ ],
+ [
+ dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
+ dict(type='AutoContrast', prob=0.6)
+ ],
+ [dict(type='Invert', prob=0.6),
+ dict(type='Equalize', prob=1.)],
+ [
+ dict(type='ColorTransform', magnitude=0.4, prob=0.6),
+ dict(type='Contrast', magnitude=0.8, prob=1.)
+ ],
+ [dict(type='Equalize', prob=0.8),
+ dict(type='Equalize', prob=0.6)],
+]
diff --git a/configs/_base_/datasets/pipelines/rand_aug.py b/configs/_base_/datasets/pipelines/rand_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2bab3c364f0d0223f2c972673da3abb6ac21bc6
--- /dev/null
+++ b/configs/_base_/datasets/pipelines/rand_aug.py
@@ -0,0 +1,43 @@
+# Refers to `_RAND_INCREASING_TRANSFORMS` in pytorch-image-models
+rand_increasing_policies = [
+ dict(type='AutoContrast'),
+ dict(type='Equalize'),
+ dict(type='Invert'),
+ dict(type='Rotate', magnitude_key='angle', magnitude_range=(0, 30)),
+ dict(type='Posterize', magnitude_key='bits', magnitude_range=(4, 0)),
+ dict(type='Solarize', magnitude_key='thr', magnitude_range=(256, 0)),
+ dict(
+ type='SolarizeAdd',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 110)),
+ dict(
+ type='ColorTransform',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.9)),
+ dict(type='Contrast', magnitude_key='magnitude', magnitude_range=(0, 0.9)),
+ dict(
+ type='Brightness', magnitude_key='magnitude',
+ magnitude_range=(0, 0.9)),
+ dict(
+ type='Sharpness', magnitude_key='magnitude', magnitude_range=(0, 0.9)),
+ dict(
+ type='Shear',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ direction='horizontal'),
+ dict(
+ type='Shear',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ direction='vertical'),
+ dict(
+ type='Translate',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.45),
+ direction='horizontal'),
+ dict(
+ type='Translate',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.45),
+ direction='vertical')
+]
diff --git a/configs/_base_/datasets/refcoco.py b/configs/_base_/datasets/refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..f698e76c032fb22cc739450cc1e81e3174fd2b2f
--- /dev/null
+++ b/configs/_base_/datasets/refcoco.py
@@ -0,0 +1,105 @@
+# data settings
+
+data_preprocessor = dict(
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.4,
+ hue=0.1,
+ backend='cv2')
+ ],
+ prob=0.5),
+ dict(
+ type='mmdet.RandomCrop',
+ crop_type='relative_range',
+ crop_size=(0.8, 0.8),
+ allow_negative_crop=False),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(384, 384), (360, 360), (344, 344), (312, 312), (300, 300),
+ (286, 286), (270, 270)],
+ keep_ratio=False),
+ dict(
+ type='RandomTranslatePad',
+ size=384,
+ aug_translate=True,
+ ),
+ dict(type='CleanCaption', keys='text'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text', 'gt_bboxes', 'scale_factor'],
+ meta_keys=['image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(384, 384),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CleanCaption', keys='text'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text', 'gt_bboxes', 'scale_factor'],
+ meta_keys=['image_id'],
+ ),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='RefCOCO',
+ data_root='data/coco',
+ data_prefix='train2014',
+ ann_file='refcoco/instances.json',
+ split_file='refcoco/refs(unc).p',
+ split='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='RefCOCO',
+ data_root='data/coco',
+ data_prefix='train2014',
+ ann_file='refcoco/instances.json',
+ split_file='refcoco/refs(unc).p',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+
+val_evaluator = dict(type='VisualGroundingMetric')
+
+test_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='RefCOCO',
+ data_root='data/coco',
+ data_prefix='train2014',
+ ann_file='refcoco/instances.json',
+ split_file='refcoco/refs(unc).p',
+ split='testA', # or 'testB'
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/tiny_imagenet_bs32.py b/configs/_base_/datasets/tiny_imagenet_bs32.py
new file mode 100644
index 0000000000000000000000000000000000000000..6701413de0f7a4b65044dbf513a4267b9092500e
--- /dev/null
+++ b/configs/_base_/datasets/tiny_imagenet_bs32.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'CustomDataset'
+data_preprocessor = dict(
+ num_classes=200,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ data_prefix='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ data_prefix='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/tiny_imagenet_bs32_pil_resize.py b/configs/_base_/datasets/tiny_imagenet_bs32_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..66250a49aaa549c00623c8549c4eafcae71a9254
--- /dev/null
+++ b/configs/_base_/datasets/tiny_imagenet_bs32_pil_resize.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'CustomDataset'
+data_preprocessor = dict(
+ num_classes=200,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ data_prefix='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ data_prefix='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/tiny_imagenet_bs64_pil_resize_autoaug.py b/configs/_base_/datasets/tiny_imagenet_bs64_pil_resize_autoaug.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c41d7f1eed186f254150acaa6d9290b27478936
--- /dev/null
+++ b/configs/_base_/datasets/tiny_imagenet_bs64_pil_resize_autoaug.py
@@ -0,0 +1,68 @@
+# dataset settings
+dataset_type = 'CustomDataset'
+data_preprocessor = dict(
+ num_classes=200,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='AutoAugment',
+ policies='imagenet',
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ data_prefix='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ data_prefix='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py b/configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..bddb78bf25273d6d244368d411c4e8fb9235b871
--- /dev/null
+++ b/configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'CustomDataset'
+data_preprocessor = dict(
+ num_classes=200,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ data_prefix='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ data_prefix='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/vizwiz.py b/configs/_base_/datasets/vizwiz.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb7156c07030e9c031c8796c62267b7c4a8b2d7a
--- /dev/null
+++ b/configs/_base_/datasets/vizwiz.py
@@ -0,0 +1,80 @@
+# data settings
+
+data_preprocessor = dict(
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(480, 480),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='CleanCaption',
+ keys=['question'],
+ ),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='VizWiz',
+ data_root='data/vizwiz/Images',
+ data_prefix='',
+ ann_file='Annotations/train.json',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='VizWiz',
+ data_root='data/vizwiz/Images',
+ data_prefix='',
+ ann_file='Annotations/val.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='VizWizAcc')
+
+test_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='VizWiz',
+ data_root='data/vizwiz/Images',
+ data_prefix='',
+ ann_file='Annotations/test.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test.json')
diff --git a/configs/_base_/datasets/voc_bs16.py b/configs/_base_/datasets/voc_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..cac2248cb6f0fc96a1e1407e06bba5fbc9e70a4b
--- /dev/null
+++ b/configs/_base_/datasets/voc_bs16.py
@@ -0,0 +1,65 @@
+# dataset settings
+dataset_type = 'VOC'
+data_preprocessor = dict(
+ num_classes=20,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+ # generate onehot-format labels for multi-label classification.
+ to_onehot=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(
+ type='PackInputs',
+ # `gt_label_difficult` is needed for VOC evaluation
+ meta_keys=('sample_idx', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction',
+ 'gt_label_difficult')),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/VOC2007',
+ split='trainval',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/VOC2007',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+
+test_dataloader = val_dataloader
+
+# calculate precision_recall_f1 and mAP
+val_evaluator = [
+ dict(type='VOCMultiLabelMetric'),
+ dict(type='VOCMultiLabelMetric', average='micro'),
+ dict(type='VOCAveragePrecision')
+]
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/vsr.py b/configs/_base_/datasets/vsr.py
new file mode 100644
index 0000000000000000000000000000000000000000..0fa9b8992d0c453797b38add80dd6c92fbfa9227
--- /dev/null
+++ b/configs/_base_/datasets/vsr.py
@@ -0,0 +1,81 @@
+# data settings
+
+data_preprocessor = dict(
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(480, 480),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='CleanCaption',
+ keys=['question'],
+ ),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+train_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='VSR',
+ data_root='data/coco',
+ data_prefix='',
+ ann_file='annotations/train.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+ drop_last=True,
+)
+
+val_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='VSR',
+ data_root='data/coco',
+ data_prefix='',
+ ann_file='annotations/val.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='VSRAcc')
+
+test_dataloader = dict(
+ batch_size=16,
+ num_workers=8,
+ dataset=dict(
+ type='VSR',
+ data_root='data/coco',
+ data_prefix='',
+ ann_file='annotations/test.json',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/default_runtime.py b/configs/_base_/default_runtime.py
new file mode 100644
index 0000000000000000000000000000000000000000..3816d423fabab10d26b0abfea1f60eb270c1dc83
--- /dev/null
+++ b/configs/_base_/default_runtime.py
@@ -0,0 +1,51 @@
+# defaults to use registries in mmpretrain
+default_scope = 'mmpretrain'
+
+# configure default hooks
+default_hooks = dict(
+ # record the time of every iteration.
+ timer=dict(type='IterTimerHook'),
+
+ # print log every 100 iterations.
+ logger=dict(type='LoggerHook', interval=100),
+
+ # enable the parameter scheduler.
+ param_scheduler=dict(type='ParamSchedulerHook'),
+
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1),
+
+ # set sampler seed in distributed evrionment.
+ sampler_seed=dict(type='DistSamplerSeedHook'),
+
+ # validation results visualization, set True to enable it.
+ visualization=dict(type='VisualizationHook', enable=False),
+)
+
+# configure environment
+env_cfg = dict(
+ # whether to enable cudnn benchmark
+ cudnn_benchmark=False,
+
+ # set multi process parameters
+ mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+ # set distributed parameters
+ dist_cfg=dict(backend='nccl'),
+)
+
+# set visualizer
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(type='UniversalVisualizer', vis_backends=vis_backends)
+
+# set log level
+log_level = 'INFO'
+
+# load from which checkpoint
+load_from = None
+
+# whether to resume training from the loaded checkpoint
+resume = False
+
+# Defaults to use random seed and disable `deterministic`
+randomness = dict(seed=None, deterministic=False)
diff --git a/configs/_base_/models/conformer/base-p16.py b/configs/_base_/models/conformer/base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..959da5059a8f36c1076bf9875c51fd466fc96fa4
--- /dev/null
+++ b/configs/_base_/models/conformer/base-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Conformer', arch='base', drop_path_rate=0.1, init_cfg=None),
+ neck=None,
+ head=dict(
+ type='ConformerHead',
+ num_classes=1000,
+ in_channels=[1536, 576],
+ init_cfg=None,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/conformer/small-p16.py b/configs/_base_/models/conformer/small-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e4f9f80745af51538306bd8928082f3fd2e9997
--- /dev/null
+++ b/configs/_base_/models/conformer/small-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Conformer', arch='small', drop_path_rate=0.1, init_cfg=None),
+ neck=None,
+ head=dict(
+ type='ConformerHead',
+ num_classes=1000,
+ in_channels=[1024, 384],
+ init_cfg=None,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/conformer/small-p32.py b/configs/_base_/models/conformer/small-p32.py
new file mode 100644
index 0000000000000000000000000000000000000000..f73811fff492f3e1770e514335ccc71b2bd3caf6
--- /dev/null
+++ b/configs/_base_/models/conformer/small-p32.py
@@ -0,0 +1,27 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Conformer',
+ arch='small',
+ patch_size=32,
+ drop_path_rate=0.1,
+ init_cfg=None),
+ neck=None,
+ head=dict(
+ type='ConformerHead',
+ num_classes=1000,
+ in_channels=[1024, 384],
+ init_cfg=None,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/conformer/tiny-p16.py b/configs/_base_/models/conformer/tiny-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa9753b6fac957a0c8f9612bd0b9a693a3ecbf4e
--- /dev/null
+++ b/configs/_base_/models/conformer/tiny-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Conformer', arch='tiny', drop_path_rate=0.1, init_cfg=None),
+ neck=None,
+ head=dict(
+ type='ConformerHead',
+ num_classes=1000,
+ in_channels=[256, 384],
+ init_cfg=None,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/convmixer/convmixer-1024-20.py b/configs/_base_/models/convmixer/convmixer-1024-20.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8f4d517e0d5e74c0d0412bb6e4f43b244761c03
--- /dev/null
+++ b/configs/_base_/models/convmixer/convmixer-1024-20.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ConvMixer', arch='1024/20'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/convmixer/convmixer-1536-20.py b/configs/_base_/models/convmixer/convmixer-1536-20.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ad8209bb4fc55665be36cdcd8102d854c533951
--- /dev/null
+++ b/configs/_base_/models/convmixer/convmixer-1536-20.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ConvMixer', arch='1536/20'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/convmixer/convmixer-768-32.py b/configs/_base_/models/convmixer/convmixer-768-32.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cba528b0edf9d394ae9730ecd51d41bbd314b38
--- /dev/null
+++ b/configs/_base_/models/convmixer/convmixer-768-32.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ConvMixer', arch='768/32', act_cfg=dict(type='ReLU')),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/convnext/convnext-base.py b/configs/_base_/models/convnext/convnext-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..aba6c19d1ac5039bab2363f80d500c81d4bb809b
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-base.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ConvNeXt', arch='base', drop_path_rate=0.5),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-large.py b/configs/_base_/models/convnext/convnext-large.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bd4d9f68bd47b207de129ab169c2366156199b3
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-large.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ConvNeXt', arch='large', drop_path_rate=0.5),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-small.py b/configs/_base_/models/convnext/convnext-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..aeedb6d22fc8f80fe6c5fb246df44c8a28c41854
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-small.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ConvNeXt', arch='small', drop_path_rate=0.4),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-tiny.py b/configs/_base_/models/convnext/convnext-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..05baba09eefe44196a54c112c5c785ff79a1b52b
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-tiny.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ConvNeXt', arch='tiny', drop_path_rate=0.1),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-xlarge.py b/configs/_base_/models/convnext/convnext-xlarge.py
new file mode 100644
index 0000000000000000000000000000000000000000..7211b94f6cebe4c93d150dec276291f725f9f513
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-xlarge.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ConvNeXt', arch='xlarge', drop_path_rate=0.5),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/convnext_v2/atto.py b/configs/_base_/models/convnext_v2/atto.py
new file mode 100644
index 0000000000000000000000000000000000000000..557ce93fce2572fe2fd95db80da4556e0dd7810d
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/atto.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ConvNeXt',
+ arch='atto',
+ drop_path_rate=0.1,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=320,
+ loss=dict(type='LabelSmoothLoss', label_smooth_val=0.2),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/base.py b/configs/_base_/models/convnext_v2/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..1401ef75f96814d5db1f6a37aa8d8761ccfe1e39
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/base.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ConvNeXt',
+ arch='base',
+ drop_path_rate=0.1,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/convnext_v2/femto.py b/configs/_base_/models/convnext_v2/femto.py
new file mode 100644
index 0000000000000000000000000000000000000000..d56a241a97820713618480bec0fe09f94ecb1cea
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/femto.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ConvNeXt',
+ arch='femto',
+ drop_path_rate=0.1,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/huge.py b/configs/_base_/models/convnext_v2/huge.py
new file mode 100644
index 0000000000000000000000000000000000000000..54141dd5220fdd0f40ce21054890e86b19597aff
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/huge.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ConvNeXt',
+ arch='huge',
+ drop_path_rate=0.1,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2816,
+ loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/convnext_v2/large.py b/configs/_base_/models/convnext_v2/large.py
new file mode 100644
index 0000000000000000000000000000000000000000..20237de2baaccd2779bcec45549ec5a294d8ba6b
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/large.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ConvNeXt',
+ arch='large',
+ drop_path_rate=0.1,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/convnext_v2/nano.py b/configs/_base_/models/convnext_v2/nano.py
new file mode 100644
index 0000000000000000000000000000000000000000..05575d0e105da6880beafa08d1bdb0c608261a51
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/nano.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ConvNeXt',
+ arch='nano',
+ drop_path_rate=0.1,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=640,
+ loss=dict(type='LabelSmoothLoss', label_smooth_val=0.2),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/pico.py b/configs/_base_/models/convnext_v2/pico.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d50ba890069457bc512ac2d2da1038ee73cd065
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/pico.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ConvNeXt',
+ arch='pico',
+ drop_path_rate=0.1,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/tiny.py b/configs/_base_/models/convnext_v2/tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9835ccdb47f8c976be9519160ba13f6f4a168f9
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/tiny.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ConvNeXt',
+ arch='tiny',
+ drop_path_rate=0.2,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='LabelSmoothLoss', label_smooth_val=0.2),
+ init_cfg=None,
+ ),
+ init_cfg=dict(
+ type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/davit/davit-base.py b/configs/_base_/models/davit/davit-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..0dbf07739ecc907e4a77d0cdbd9c21f4c8fbecf1
--- /dev/null
+++ b/configs/_base_/models/davit/davit-base.py
@@ -0,0 +1,16 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DaViT', arch='base', out_indices=(3, ), drop_path_rate=0.4),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/davit/davit-small.py b/configs/_base_/models/davit/davit-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fa0325552c2bc28f69263ba42547090b7a521fb
--- /dev/null
+++ b/configs/_base_/models/davit/davit-small.py
@@ -0,0 +1,16 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DaViT', arch='small', out_indices=(3, ), drop_path_rate=0.2),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/davit/davit-tiny.py b/configs/_base_/models/davit/davit-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..29432d28bd09a613bf4eaabe4f8ef4d0d763a49d
--- /dev/null
+++ b/configs/_base_/models/davit/davit-tiny.py
@@ -0,0 +1,16 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DaViT', arch='t', out_indices=(3, ), drop_path_rate=0.1),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/deit3/deit3-base-p16-224.py b/configs/_base_/models/deit3/deit3-base-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..84cba1afadbf13ed78e5f3c2be112a70b5ba8be1
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-base-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DeiT3',
+ arch='b',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.2),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/deit3/deit3-base-p16-384.py b/configs/_base_/models/deit3/deit3-base-p16-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c9f42bc3a3b69c5091c5a31c0d7a137fb944cf5
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-base-p16-384.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DeiT3',
+ arch='b',
+ img_size=384,
+ patch_size=16,
+ drop_path_rate=0.15),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/deit3/deit3-huge-p14-224.py b/configs/_base_/models/deit3/deit3-huge-p14-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7a69ce914fbc32b029cb1a891fb1cf49d4bfce0
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-huge-p14-224.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DeiT3',
+ arch='h',
+ img_size=224,
+ patch_size=14,
+ drop_path_rate=0.55),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/deit3/deit3-large-p16-224.py b/configs/_base_/models/deit3/deit3-large-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..96135c57879715a1de50efd8e6c28fc635eae1ff
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-large-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DeiT3',
+ arch='l',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.45),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/deit3/deit3-large-p16-384.py b/configs/_base_/models/deit3/deit3-large-p16-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa9326c17cd0b0e1d625270140a80f1bb92fc0bf
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-large-p16-384.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DeiT3',
+ arch='l',
+ img_size=384,
+ patch_size=16,
+ drop_path_rate=0.4),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/deit3/deit3-medium-p16-224.py b/configs/_base_/models/deit3/deit3-medium-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..84233e5cfde13cd0f142b49f64c3b3ec65ff4f68
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-medium-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DeiT3',
+ arch='m',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.2),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/deit3/deit3-small-p16-224.py b/configs/_base_/models/deit3/deit3-small-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..af29d32bc799ebdff5a9724fe5555261ba0b584c
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-small-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DeiT3',
+ arch='s',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.05),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/deit3/deit3-small-p16-384.py b/configs/_base_/models/deit3/deit3-small-p16-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..bebb4845e8c3a47e1d944702c49357d6d8aa4cd6
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-small-p16-384.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DeiT3',
+ arch='s',
+ img_size=384,
+ patch_size=16,
+ drop_path_rate=0.0),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/densenet/densenet121.py b/configs/_base_/models/densenet/densenet121.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a14d302584a910e87ccf598e9434bd0685207aa
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet121.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='DenseNet', arch='121'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/densenet/densenet161.py b/configs/_base_/models/densenet/densenet161.py
new file mode 100644
index 0000000000000000000000000000000000000000..61a0d838806267a5c987fa30eeb6363f23387ef3
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet161.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='DenseNet', arch='161'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2208,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/densenet/densenet169.py b/configs/_base_/models/densenet/densenet169.py
new file mode 100644
index 0000000000000000000000000000000000000000..779ea1709256f8c001adaa3c73155c36d3363d71
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet169.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='DenseNet', arch='169'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1664,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/densenet/densenet201.py b/configs/_base_/models/densenet/densenet201.py
new file mode 100644
index 0000000000000000000000000000000000000000..2909af0d36c656c1868ff38e72981dc9dafeaa2f
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet201.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='DenseNet', arch='201'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1920,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/edgenext/edgenext-base.py b/configs/_base_/models/edgenext/edgenext-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..378397298ed9d51241ad737d65b05f151ac69393
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-base.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='EdgeNeXt',
+ arch='base',
+ out_indices=(3, ),
+ drop_path_rate=0.1,
+ gap_before_final_norm=True,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+ ]),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=584,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/edgenext/edgenext-small.py b/configs/_base_/models/edgenext/edgenext-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..e1f7e1728a2f5cb895600aa0d81eeb5734dffec0
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-small.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='EdgeNeXt',
+ arch='small',
+ out_indices=(3, ),
+ drop_path_rate=0.1,
+ gap_before_final_norm=True,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+ ]),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=304,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/edgenext/edgenext-xsmall.py b/configs/_base_/models/edgenext/edgenext-xsmall.py
new file mode 100644
index 0000000000000000000000000000000000000000..69c7d0d6a6ec9d09df03c007cd3fffa93165f5cb
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-xsmall.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='EdgeNeXt',
+ arch='xsmall',
+ out_indices=(3, ),
+ drop_path_rate=0.1,
+ gap_before_final_norm=True,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+ ]),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/edgenext/edgenext-xxsmall.py b/configs/_base_/models/edgenext/edgenext-xxsmall.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb6881951fae8c01c2a4ea78c3d61e7c6a900f24
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-xxsmall.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='EdgeNeXt',
+ arch='xxsmall',
+ out_indices=(3, ),
+ drop_path_rate=0.1,
+ gap_before_final_norm=True,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+ ]),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=168,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/efficientformer-l1.py b/configs/_base_/models/efficientformer-l1.py
new file mode 100644
index 0000000000000000000000000000000000000000..37dc62cd235ee5a3f0257a24c54c8eb4fc797159
--- /dev/null
+++ b/configs/_base_/models/efficientformer-l1.py
@@ -0,0 +1,18 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='EfficientFormer',
+ arch='l1',
+ drop_path_rate=0,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-5)
+ ]),
+ neck=dict(type='GlobalAveragePooling', dim=1),
+ head=dict(
+ type='EfficientFormerClsHead', in_channels=448, num_classes=1000))
diff --git a/configs/_base_/models/efficientnet_b0.py b/configs/_base_/models/efficientnet_b0.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9ba685306c9e411a69887a2a301808cbaa104cb
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b0.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='b0'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_b1.py b/configs/_base_/models/efficientnet_b1.py
new file mode 100644
index 0000000000000000000000000000000000000000..63e15c88b2f7e1d1c788811741ff26bf5f35601f
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b1.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='b1'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_b2.py b/configs/_base_/models/efficientnet_b2.py
new file mode 100644
index 0000000000000000000000000000000000000000..5edcfa5d5b680ec41567e531e0b7a587e160c8af
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b2.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='b2'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1408,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_b3.py b/configs/_base_/models/efficientnet_b3.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7c6d6d899ecb910a37cbd3818f8c79c27db87e9
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b3.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='b3'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_b4.py b/configs/_base_/models/efficientnet_b4.py
new file mode 100644
index 0000000000000000000000000000000000000000..06840ed559cc14ae47919f7cce67d635173e841d
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b4.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='b4'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1792,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_b5.py b/configs/_base_/models/efficientnet_b5.py
new file mode 100644
index 0000000000000000000000000000000000000000..a86eebd19042eb36534ef3f42cc16bb32e88fb66
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b5.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='b5'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_b6.py b/configs/_base_/models/efficientnet_b6.py
new file mode 100644
index 0000000000000000000000000000000000000000..4eada1d32511371bcb11c636b3aae9dc4733d379
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b6.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='b6'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2304,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_b7.py b/configs/_base_/models/efficientnet_b7.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d84ba427f42a186f376d829189461536e7ee383
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b7.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='b7'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2560,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_b8.py b/configs/_base_/models/efficientnet_b8.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9500644dae4a3240c5ecfa02f90deb8fde4e3de
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b8.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='b8'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2816,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_em.py b/configs/_base_/models/efficientnet_em.py
new file mode 100644
index 0000000000000000000000000000000000000000..abecdbeef6c3791f902b6bd13fbceb28c3ac8942
--- /dev/null
+++ b/configs/_base_/models/efficientnet_em.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ # `em` means EfficientNet-EdgeTPU-M arch
+ backbone=dict(type='EfficientNet', arch='em', act_cfg=dict(type='ReLU')),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_es.py b/configs/_base_/models/efficientnet_es.py
new file mode 100644
index 0000000000000000000000000000000000000000..911ba4a18261decd3d17e8962501083e1f1ea550
--- /dev/null
+++ b/configs/_base_/models/efficientnet_es.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ # `es` means EfficientNet-EdgeTPU-S arch
+ backbone=dict(type='EfficientNet', arch='es', act_cfg=dict(type='ReLU')),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_l2.py b/configs/_base_/models/efficientnet_l2.py
new file mode 100644
index 0000000000000000000000000000000000000000..4219c87a81a93c50296cfebed8f20b9bbd2a4c13
--- /dev/null
+++ b/configs/_base_/models/efficientnet_l2.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNet', arch='l2'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=5504,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b0.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b0.py
new file mode 100644
index 0000000000000000000000000000000000000000..d42e32905ed9d18ab572bfe1e90c7161f941a34f
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b0.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNetV2', arch='b0'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b1.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b1.py
new file mode 100644
index 0000000000000000000000000000000000000000..10736fc504637b07fe362e27c5e86ea73990217a
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b1.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNetV2', arch='b1'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b2.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b2.py
new file mode 100644
index 0000000000000000000000000000000000000000..61f477120e031cd8cf46340bdbd3c687ade2a035
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b2.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNetV2', arch='b2'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1408,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b3.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b3.py
new file mode 100644
index 0000000000000000000000000000000000000000..14e523fd2e4180e960aa8a3282e56f6604c38a47
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b3.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNetV2', arch='b3'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_l.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_l.py
new file mode 100644
index 0000000000000000000000000000000000000000..456467d6fa076db11b009fca875e231569e05288
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_l.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNetV2', arch='l'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_m.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_m.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e4d303f624d3375416b7c41c59a68a1a64e4a19
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_m.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNetV2', arch='m'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_s.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_s.py
new file mode 100644
index 0000000000000000000000000000000000000000..866648223c79aac1ca8519a1d18b167b7ac474ec
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_s.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNetV2', arch='s'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_xl.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_xl.py
new file mode 100644
index 0000000000000000000000000000000000000000..2216c9daa7d5e5e11084320b3aeab6a388588f40
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_xl.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='EfficientNetV2', arch='xl'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/eva/eva-g.py b/configs/_base_/models/eva/eva-g.py
new file mode 100644
index 0000000000000000000000000000000000000000..17bc84ad8bd2ac5599f26351b5fb5ca3fb8ec8bc
--- /dev/null
+++ b/configs/_base_/models/eva/eva-g.py
@@ -0,0 +1,29 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='eva-g',
+ img_size=224,
+ patch_size=14,
+ layer_scale_init_value=0.0,
+ out_type='avg_featmap',
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ use_shared_rel_pos_bias=False,
+ ),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1408,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/eva/eva-l.py b/configs/_base_/models/eva/eva-l.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b08e4b1e1881b706848c121ceb3b4d23cfae34a
--- /dev/null
+++ b/configs/_base_/models/eva/eva-l.py
@@ -0,0 +1,30 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='l',
+ img_size=224,
+ patch_size=14,
+ layer_scale_init_value=0.0,
+ out_type='avg_featmap',
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ use_shared_rel_pos_bias=False,
+ layer_cfgs=dict(bias=True),
+ ),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/hivit/base_224.py b/configs/_base_/models/hivit/base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..a87a68cf6f03e3e794361324fe5158b6a7dc5faa
--- /dev/null
+++ b/configs/_base_/models/hivit/base_224.py
@@ -0,0 +1,28 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='HiViT',
+ arch='base',
+ img_size=224,
+ ape=True,
+ rpe=True,
+ drop_path_rate=0.5),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/hivit/small_224.py b/configs/_base_/models/hivit/small_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..333b2461d3ef681dd24f367f18e38f2cc87dd2de
--- /dev/null
+++ b/configs/_base_/models/hivit/small_224.py
@@ -0,0 +1,28 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='HiViT',
+ arch='small',
+ img_size=224,
+ ape=True,
+ rpe=True,
+ drop_path_rate=0.3),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/hivit/tiny_224.py b/configs/_base_/models/hivit/tiny_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b3e2fdb3ce64aa8cfe42fb0b923d34fcdbb0524f
--- /dev/null
+++ b/configs/_base_/models/hivit/tiny_224.py
@@ -0,0 +1,28 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='HiViT',
+ arch='tiny',
+ img_size=224,
+ ape=True,
+ rpe=True,
+ drop_path_rate=0.05),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/hornet/hornet-base-gf.py b/configs/_base_/models/hornet/hornet-base-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6924f96265cda310a38765fa460ad685d9d01b7
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-base-gf.py
@@ -0,0 +1,20 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HorNet', arch='base-gf', drop_path_rate=0.5),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-6)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/hornet/hornet-base.py b/configs/_base_/models/hornet/hornet-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..904379ab5f258fa366d75166e7446fccecf0bc2c
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-base.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HorNet', arch='base', drop_path_rate=0.5),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-6)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/hornet/hornet-large-gf.py b/configs/_base_/models/hornet/hornet-large-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..1607ba2208415699697f8ada17941cc75a6270a9
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-large-gf.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HorNet', arch='large-gf', drop_path_rate=0.2),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-6)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/hornet/hornet-large-gf384.py b/configs/_base_/models/hornet/hornet-large-gf384.py
new file mode 100644
index 0000000000000000000000000000000000000000..fbb547873ed047adaed448fb1d443b4de8750ea4
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-large-gf384.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HorNet', arch='large-gf384', drop_path_rate=0.4),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-6)
+ ])
diff --git a/configs/_base_/models/hornet/hornet-large.py b/configs/_base_/models/hornet/hornet-large.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5494fd8985970c2a60424ab6b6e07cd8965a6ed
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-large.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HorNet', arch='large', drop_path_rate=0.2),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-6)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/hornet/hornet-small-gf.py b/configs/_base_/models/hornet/hornet-small-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..42e26d3a4bf75aab77a3fbdda2135bed98223476
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-small-gf.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HorNet', arch='small-gf', drop_path_rate=0.4),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-6)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/hornet/hornet-small.py b/configs/_base_/models/hornet/hornet-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..d59184d40ab2f8a5c03c82caeade85dcd32c9180
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-small.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HorNet', arch='small', drop_path_rate=0.4),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-6)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/hornet/hornet-tiny-gf.py b/configs/_base_/models/hornet/hornet-tiny-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b06f5b121f18f26c5a3a3442f3bbf8842bdd206
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-tiny-gf.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HorNet', arch='tiny-gf', drop_path_rate=0.2),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-6)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/hornet/hornet-tiny.py b/configs/_base_/models/hornet/hornet-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..aed710eb862467da4d39c13a4fad41e7e6b76f29
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-tiny.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HorNet', arch='tiny', drop_path_rate=0.2),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ dict(type='Constant', layer=['LayerScale'], val=1e-6)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/hrnet/hrnet-w18.py b/configs/_base_/models/hrnet/hrnet-w18.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7fbf298d5b64ba1cefa46a4a5d2823c2fa8cf17
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w18.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HRNet', arch='w18'),
+ neck=[
+ dict(type='HRFuseScales', in_channels=(18, 36, 72, 144)),
+ dict(type='GlobalAveragePooling'),
+ ],
+ head=dict(
+ type='LinearClsHead',
+ in_channels=2048,
+ num_classes=1000,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/hrnet/hrnet-w30.py b/configs/_base_/models/hrnet/hrnet-w30.py
new file mode 100644
index 0000000000000000000000000000000000000000..babcacac59af0ff92802a71f48b249b29a760acb
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w30.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HRNet', arch='w30'),
+ neck=[
+ dict(type='HRFuseScales', in_channels=(30, 60, 120, 240)),
+ dict(type='GlobalAveragePooling'),
+ ],
+ head=dict(
+ type='LinearClsHead',
+ in_channels=2048,
+ num_classes=1000,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/hrnet/hrnet-w32.py b/configs/_base_/models/hrnet/hrnet-w32.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c1e980048d6bb855b94e0bb3027941d07513c05
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w32.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HRNet', arch='w32'),
+ neck=[
+ dict(type='HRFuseScales', in_channels=(32, 64, 128, 256)),
+ dict(type='GlobalAveragePooling'),
+ ],
+ head=dict(
+ type='LinearClsHead',
+ in_channels=2048,
+ num_classes=1000,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/hrnet/hrnet-w40.py b/configs/_base_/models/hrnet/hrnet-w40.py
new file mode 100644
index 0000000000000000000000000000000000000000..83f65d864679297b25b39438d49eb491c92c33a1
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w40.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HRNet', arch='w40'),
+ neck=[
+ dict(type='HRFuseScales', in_channels=(40, 80, 160, 320)),
+ dict(type='GlobalAveragePooling'),
+ ],
+ head=dict(
+ type='LinearClsHead',
+ in_channels=2048,
+ num_classes=1000,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/hrnet/hrnet-w44.py b/configs/_base_/models/hrnet/hrnet-w44.py
new file mode 100644
index 0000000000000000000000000000000000000000..e75dc0f891f6f9dd14ba31b865fd29afd622f4db
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w44.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HRNet', arch='w44'),
+ neck=[
+ dict(type='HRFuseScales', in_channels=(44, 88, 176, 352)),
+ dict(type='GlobalAveragePooling'),
+ ],
+ head=dict(
+ type='LinearClsHead',
+ in_channels=2048,
+ num_classes=1000,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/hrnet/hrnet-w48.py b/configs/_base_/models/hrnet/hrnet-w48.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0604958481ba2af277e3a0f9515dc1423def6c6
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w48.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HRNet', arch='w48'),
+ neck=[
+ dict(type='HRFuseScales', in_channels=(48, 96, 192, 384)),
+ dict(type='GlobalAveragePooling'),
+ ],
+ head=dict(
+ type='LinearClsHead',
+ in_channels=2048,
+ num_classes=1000,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/hrnet/hrnet-w64.py b/configs/_base_/models/hrnet/hrnet-w64.py
new file mode 100644
index 0000000000000000000000000000000000000000..844c3fe9413f624dd374ceb1a9c3bbc185a20a3e
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w64.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='HRNet', arch='w64'),
+ neck=[
+ dict(type='HRFuseScales', in_channels=(64, 128, 256, 512)),
+ dict(type='GlobalAveragePooling'),
+ ],
+ head=dict(
+ type='LinearClsHead',
+ in_channels=2048,
+ num_classes=1000,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/inception_v3.py b/configs/_base_/models/inception_v3.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f6a8305efe2ef87cfd0d2676056a07595831c6b
--- /dev/null
+++ b/configs/_base_/models/inception_v3.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='InceptionV3', num_classes=1000, aux_logits=False),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5)),
+)
diff --git a/configs/_base_/models/itpn_hivit-base-p16.py b/configs/_base_/models/itpn_hivit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..834d6fe53b30b3370df0e5aaa08d6786472810a6
--- /dev/null
+++ b/configs/_base_/models/itpn_hivit-base-p16.py
@@ -0,0 +1,33 @@
+# model settings
+model = dict(
+ type='iTPN',
+ backbone=dict(
+ type='iTPNHiViT',
+ arch='base',
+ reconstruction_type='pixel',
+ mask_ratio=0.75),
+ neck=dict(
+ type='iTPNPretrainDecoder',
+ num_patches=196,
+ patch_size=16,
+ in_chans=3,
+ embed_dim=512,
+ decoder_embed_dim=512,
+ decoder_depth=6,
+ decoder_num_heads=16,
+ mlp_ratio=4.,
+ reconstruction_type='pixel',
+ # transformer pyramid
+ fpn_dim=256,
+ fpn_depth=2,
+ num_outs=3,
+ ),
+ head=dict(
+ type='MAEPretrainHead',
+ norm_pix=True,
+ patch_size=16,
+ loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+ init_cfg=[
+ dict(type='Xavier', layer='Linear', distribution='uniform'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ])
diff --git a/configs/_base_/models/levit-256-p16.py b/configs/_base_/models/levit-256-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..936305bd254cb0c46f1bd0e8d0698f76b9a765c4
--- /dev/null
+++ b/configs/_base_/models/levit-256-p16.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='LeViT',
+ arch='256',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0,
+ attn_ratio=2,
+ mlp_ratio=2,
+ out_indices=(2, )),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LeViTClsHead',
+ num_classes=1000,
+ in_channels=512,
+ distillation=True,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, loss_weight=1.0),
+ topk=(1, 5),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]))
diff --git a/configs/_base_/models/mae_hivit-base-p16.py b/configs/_base_/models/mae_hivit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..bac073c840120c67e3c97b43bd5b308c62dbbbd9
--- /dev/null
+++ b/configs/_base_/models/mae_hivit-base-p16.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='MAE',
+ backbone=dict(
+ type='MAEHiViT', patch_size=16, arch='base', mask_ratio=0.75),
+ neck=dict(
+ type='MAEPretrainDecoder',
+ patch_size=16,
+ in_chans=3,
+ embed_dim=512,
+ decoder_embed_dim=512,
+ decoder_depth=6,
+ decoder_num_heads=16,
+ mlp_ratio=4.,
+ ),
+ head=dict(
+ type='MAEPretrainHead',
+ norm_pix=True,
+ patch_size=16,
+ loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+ init_cfg=[
+ dict(type='Xavier', layer='Linear', distribution='uniform'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ])
diff --git a/configs/_base_/models/mae_vit-base-p16.py b/configs/_base_/models/mae_vit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cde8cb7c775d82941324f1abfa3432727b08a07
--- /dev/null
+++ b/configs/_base_/models/mae_vit-base-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+ type='MAE',
+ backbone=dict(type='MAEViT', arch='b', patch_size=16, mask_ratio=0.75),
+ neck=dict(
+ type='MAEPretrainDecoder',
+ patch_size=16,
+ in_chans=3,
+ embed_dim=768,
+ decoder_embed_dim=512,
+ decoder_depth=8,
+ decoder_num_heads=16,
+ mlp_ratio=4.,
+ ),
+ head=dict(
+ type='MAEPretrainHead',
+ norm_pix=True,
+ patch_size=16,
+ loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+ init_cfg=[
+ dict(type='Xavier', layer='Linear', distribution='uniform'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ])
diff --git a/configs/_base_/models/mixmim/mixmim_base.py b/configs/_base_/models/mixmim/mixmim_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..ccde357570d22d3e1147b14ec480fd6b31f6a4cf
--- /dev/null
+++ b/configs/_base_/models/mixmim/mixmim_base.py
@@ -0,0 +1,20 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MixMIMTransformer', arch='B', drop_rate=0.0, drop_path_rate=0.1),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ init_cfg=None,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/mlp_mixer_base_patch16.py b/configs/_base_/models/mlp_mixer_base_patch16.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ebd17f337bb3d6f14e0a45b40ef6f3342477090
--- /dev/null
+++ b/configs/_base_/models/mlp_mixer_base_patch16.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MlpMixer',
+ arch='b',
+ img_size=224,
+ patch_size=16,
+ drop_rate=0.1,
+ init_cfg=[
+ dict(
+ type='Kaiming',
+ layer='Conv2d',
+ mode='fan_in',
+ nonlinearity='linear')
+ ]),
+ neck=dict(type='GlobalAveragePooling', dim=1),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ),
+)
diff --git a/configs/_base_/models/mlp_mixer_large_patch16.py b/configs/_base_/models/mlp_mixer_large_patch16.py
new file mode 100644
index 0000000000000000000000000000000000000000..ff107139bc9aa202b5b60696761f4167c25b5be3
--- /dev/null
+++ b/configs/_base_/models/mlp_mixer_large_patch16.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MlpMixer',
+ arch='l',
+ img_size=224,
+ patch_size=16,
+ drop_rate=0.1,
+ init_cfg=[
+ dict(
+ type='Kaiming',
+ layer='Conv2d',
+ mode='fan_in',
+ nonlinearity='linear')
+ ]),
+ neck=dict(type='GlobalAveragePooling', dim=1),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ),
+)
diff --git a/configs/_base_/models/mobilenet_v2_1x.py b/configs/_base_/models/mobilenet_v2_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ebff1eff937a1390f23567c37debd164aeb8c9e
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v2_1x.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MobileNetV2', widen_factor=1.0),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..5318f50feeb7d0d3f54bd70e6f854d1a74fb0743
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MobileNetV3', arch='large'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='StackedLinearClsHead',
+ num_classes=1000,
+ in_channels=960,
+ mid_channels=[1280],
+ dropout_rate=0.2,
+ act_cfg=dict(type='HSwish'),
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ init_cfg=dict(
+ type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..6356efcd1bf4beacb200f9bb4a3780963c68a302
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MobileNetV3', arch='small_050'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='StackedLinearClsHead',
+ num_classes=1000,
+ in_channels=288,
+ mid_channels=[1024],
+ dropout_rate=0.2,
+ act_cfg=dict(type='HSwish'),
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ init_cfg=dict(
+ type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..19391ec26a2b1d86d0707a780e60033db166149c
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MobileNetV3', arch='small_075'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='StackedLinearClsHead',
+ num_classes=1000,
+ in_channels=432,
+ mid_channels=[1024],
+ dropout_rate=0.2,
+ act_cfg=dict(type='HSwish'),
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ init_cfg=dict(
+ type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..5dbe980c47c83733b94a7cfe5b5ae44b3dd15729
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MobileNetV3', arch='small'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='StackedLinearClsHead',
+ num_classes=10,
+ in_channels=576,
+ mid_channels=[1280],
+ act_cfg=dict(type='HSwish'),
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..af6cc1b8d9dcb5b0ec21b38317950149a8a61a10
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MobileNetV3', arch='small'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='StackedLinearClsHead',
+ num_classes=1000,
+ in_channels=576,
+ mid_channels=[1024],
+ dropout_rate=0.2,
+ act_cfg=dict(type='HSwish'),
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ init_cfg=dict(
+ type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/mobileone/mobileone_s0.py b/configs/_base_/models/mobileone/mobileone_s0.py
new file mode 100644
index 0000000000000000000000000000000000000000..39624e5594e5270376a3e08719831f5e84ff234a
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s0.py
@@ -0,0 +1,19 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MobileOne',
+ arch='s0',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ mode='original',
+ ),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/mobileone/mobileone_s1.py b/configs/_base_/models/mobileone/mobileone_s1.py
new file mode 100644
index 0000000000000000000000000000000000000000..cea7762e4b93d6fde21901dbcdb9593209439a5f
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s1.py
@@ -0,0 +1,19 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MobileOne',
+ arch='s1',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ mode='original',
+ ),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/mobileone/mobileone_s2.py b/configs/_base_/models/mobileone/mobileone_s2.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfae0e1f1a896830d0fde43fdada9f84c3fd3e30
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s2.py
@@ -0,0 +1,19 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MobileOne',
+ arch='s2',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ mode='original',
+ ),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/mobileone/mobileone_s3.py b/configs/_base_/models/mobileone/mobileone_s3.py
new file mode 100644
index 0000000000000000000000000000000000000000..813567530413cc4b73a3aef08a8b58dc9fca47e1
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s3.py
@@ -0,0 +1,19 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MobileOne',
+ arch='s3',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ mode='original',
+ ),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/mobileone/mobileone_s4.py b/configs/_base_/models/mobileone/mobileone_s4.py
new file mode 100644
index 0000000000000000000000000000000000000000..282eec8bcf1ce3adf2bfc3861734f1a5b65ea7bf
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s4.py
@@ -0,0 +1,19 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MobileOne',
+ arch='s4',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ mode='original',
+ ),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/mobilevit/mobilevit_s.py b/configs/_base_/models/mobilevit/mobilevit_s.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6a4e05d2c8f1fc4f7b6a6b5953ff52cdfc7a2c6
--- /dev/null
+++ b/configs/_base_/models/mobilevit/mobilevit_s.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MobileViT', arch='small'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=640,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/mobilevit/mobilevit_xs.py b/configs/_base_/models/mobilevit/mobilevit_xs.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8c6ef08eb0876bd70508fe72fd81e45470ffbf8
--- /dev/null
+++ b/configs/_base_/models/mobilevit/mobilevit_xs.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MobileViT', arch='x_small'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/mobilevit/mobilevit_xxs.py b/configs/_base_/models/mobilevit/mobilevit_xxs.py
new file mode 100644
index 0000000000000000000000000000000000000000..e1c26e6f3e9f559b2599589b7de690ef45ea5611
--- /dev/null
+++ b/configs/_base_/models/mobilevit/mobilevit_xxs.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MobileViT', arch='xx_small'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=320,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/mvit/mvitv2-base.py b/configs/_base_/models/mvit/mvitv2-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cb6064f627bb9ec8e80295623be6c734d1c03c9
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-base.py
@@ -0,0 +1,19 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MViT', arch='base', drop_path_rate=0.3),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ in_channels=768,
+ num_classes=1000,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/mvit/mvitv2-large.py b/configs/_base_/models/mvit/mvitv2-large.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c84424311334030010f4b0651876ee8c3bc57cc
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-large.py
@@ -0,0 +1,23 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MViT',
+ arch='large',
+ drop_path_rate=0.5,
+ dim_mul_in_attention=False),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ in_channels=1152,
+ num_classes=1000,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/mvit/mvitv2-small.py b/configs/_base_/models/mvit/mvitv2-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..df895f2950cbf7aa009c308a86352147e427e309
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-small.py
@@ -0,0 +1,19 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MViT', arch='small', drop_path_rate=0.1),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ in_channels=768,
+ num_classes=1000,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/mvit/mvitv2-tiny.py b/configs/_base_/models/mvit/mvitv2-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..836f04bfce975487ccb05d38f47150e128313918
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-tiny.py
@@ -0,0 +1,19 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='MViT', arch='tiny', drop_path_rate=0.1),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ in_channels=768,
+ num_classes=1000,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/_base_/models/poolformer/poolformer_m36.py b/configs/_base_/models/poolformer/poolformer_m36.py
new file mode 100644
index 0000000000000000000000000000000000000000..276a72122b18f0731aded4c7652897d92814d53d
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_m36.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PoolFormer',
+ arch='m36',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/poolformer/poolformer_m48.py b/configs/_base_/models/poolformer/poolformer_m48.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c006acbc0d01caa8ecc66b26a3d7b0e75725dab
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_m48.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PoolFormer',
+ arch='m48',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/poolformer/poolformer_s12.py b/configs/_base_/models/poolformer/poolformer_s12.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7b3600f35813acc633845050b1280873ac7ee47
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_s12.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PoolFormer',
+ arch='s12',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/poolformer/poolformer_s24.py b/configs/_base_/models/poolformer/poolformer_s24.py
new file mode 100644
index 0000000000000000000000000000000000000000..822ab5b309c043569cfff4f124680906e9593a5b
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_s24.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PoolFormer',
+ arch='s24',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/poolformer/poolformer_s36.py b/configs/_base_/models/poolformer/poolformer_s36.py
new file mode 100644
index 0000000000000000000000000000000000000000..489f2223c0dbfe25d02dc804843ff8ce379639d2
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_s36.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PoolFormer',
+ arch='s36',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/regnet/regnetx_1.6gf.py b/configs/_base_/models/regnet/regnetx_1.6gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..b81f0ad25bc5c6ccf1775e580f59b86a851fb950
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_1.6gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='RegNet', arch='regnetx_1.6gf'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=912,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/regnet/regnetx_12gf.py b/configs/_base_/models/regnet/regnetx_12gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..383d4f87992d3d7cb6b9de35e2a82e371a46b12c
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_12gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='RegNet', arch='regnetx_12gf'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2240,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/regnet/regnetx_3.2gf.py b/configs/_base_/models/regnet/regnetx_3.2gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..67d454139586d60c17f5468807f761f7835fd0f7
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_3.2gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='RegNet', arch='regnetx_3.2gf'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1008,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/regnet/regnetx_4.0gf.py b/configs/_base_/models/regnet/regnetx_4.0gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..01419c64bd18a5a1f9a0c9606209726b957f24ea
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_4.0gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='RegNet', arch='regnetx_4.0gf'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1360,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/regnet/regnetx_400mf.py b/configs/_base_/models/regnet/regnetx_400mf.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef518b9f7df4484c158d24e9522a61e41cca3f15
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_400mf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='RegNet', arch='regnetx_400mf'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/regnet/regnetx_6.4gf.py b/configs/_base_/models/regnet/regnetx_6.4gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e6222af015cd5a93e5feccdb98348f1da3991a
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_6.4gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='RegNet', arch='regnetx_6.4gf'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1624,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/regnet/regnetx_8.0gf.py b/configs/_base_/models/regnet/regnetx_8.0gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..29298268d767b45d3d5dcde4dd72663b1c407525
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_8.0gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='RegNet', arch='regnetx_8.0gf'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1920,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/regnet/regnetx_800mf.py b/configs/_base_/models/regnet/regnetx_800mf.py
new file mode 100644
index 0000000000000000000000000000000000000000..210f760fe29c104c662123af4cecef143ddc9ec3
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_800mf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='RegNet', arch='regnetx_800mf'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=672,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/replknet-31B_in1k.py b/configs/_base_/models/replknet-31B_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cc50959d4bfc4597269de078ecabe5c663963b2
--- /dev/null
+++ b/configs/_base_/models/replknet-31B_in1k.py
@@ -0,0 +1,25 @@
+from mmpretrain.models import build_classifier
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RepLKNet',
+ arch='31B',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
+
+if __name__ == '__main__':
+ # model.pop('type')
+ model = build_classifier(model)
+ model.eval()
+ print('------------------- training-time model -------------')
+ for i in model.state_dict().keys():
+ print(i)
diff --git a/configs/_base_/models/replknet-31L_in1k.py b/configs/_base_/models/replknet-31L_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7830fb06f74a1ba2d7d437cc7733f446ecb12872
--- /dev/null
+++ b/configs/_base_/models/replknet-31L_in1k.py
@@ -0,0 +1,15 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RepLKNet',
+ arch='31L',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/replknet-XL_in1k.py b/configs/_base_/models/replknet-XL_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b63f3459c9914a247e8373e1fba4cbd8b4a5a81a
--- /dev/null
+++ b/configs/_base_/models/replknet-XL_in1k.py
@@ -0,0 +1,15 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RepLKNet',
+ arch='XL',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/repmlp-base_224.py b/configs/_base_/models/repmlp-base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..7db0077882168d1466fede11243f70837df29395
--- /dev/null
+++ b/configs/_base_/models/repmlp-base_224.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RepMLPNet',
+ arch='B',
+ img_size=224,
+ out_indices=(3, ),
+ reparam_conv_kernels=(1, 3),
+ deploy=False),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/repvgg-A0_in1k.py b/configs/_base_/models/repvgg-A0_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..093ffb7eea9f6af6a17e6fe766ba1f1a6160b28d
--- /dev/null
+++ b/configs/_base_/models/repvgg-A0_in1k.py
@@ -0,0 +1,15 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RepVGG',
+ arch='A0',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/repvgg-B3_lbs-mixup_in1k.py b/configs/_base_/models/repvgg-B3_lbs-mixup_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d88e687b35df35cd5993d24d929a686bf0af6f8b
--- /dev/null
+++ b/configs/_base_/models/repvgg-B3_lbs-mixup_in1k.py
@@ -0,0 +1,22 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RepVGG',
+ arch='B3',
+ out_indices=(3, ),
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2560,
+ loss=dict(
+ type='LabelSmoothLoss',
+ loss_weight=1.0,
+ label_smooth_val=0.1,
+ mode='classy_vision',
+ num_classes=1000),
+ topk=(1, 5),
+ ),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/res2net101-w26-s4.py b/configs/_base_/models/res2net101-w26-s4.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bf64c508f95f8f3d2eb14afbe85799a49ee69aa
--- /dev/null
+++ b/configs/_base_/models/res2net101-w26-s4.py
@@ -0,0 +1,18 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Res2Net',
+ depth=101,
+ scales=4,
+ base_width=26,
+ deep_stem=False,
+ avg_down=False,
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/res2net50-w14-s8.py b/configs/_base_/models/res2net50-w14-s8.py
new file mode 100644
index 0000000000000000000000000000000000000000..5875142c34d64f8414929bd43ccf37971bc97df8
--- /dev/null
+++ b/configs/_base_/models/res2net50-w14-s8.py
@@ -0,0 +1,18 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Res2Net',
+ depth=50,
+ scales=8,
+ base_width=14,
+ deep_stem=False,
+ avg_down=False,
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/res2net50-w26-s4.py b/configs/_base_/models/res2net50-w26-s4.py
new file mode 100644
index 0000000000000000000000000000000000000000..be8fdb585903564a9572b575b48967dd1a12c3f4
--- /dev/null
+++ b/configs/_base_/models/res2net50-w26-s4.py
@@ -0,0 +1,18 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Res2Net',
+ depth=50,
+ scales=4,
+ base_width=26,
+ deep_stem=False,
+ avg_down=False,
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/res2net50-w26-s6.py b/configs/_base_/models/res2net50-w26-s6.py
new file mode 100644
index 0000000000000000000000000000000000000000..281b136a67e245ee90e94bd1495b449af39118e3
--- /dev/null
+++ b/configs/_base_/models/res2net50-w26-s6.py
@@ -0,0 +1,18 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Res2Net',
+ depth=50,
+ scales=6,
+ base_width=26,
+ deep_stem=False,
+ avg_down=False,
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/res2net50-w26-s8.py b/configs/_base_/models/res2net50-w26-s8.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4f62f3ed19e4ba1f833a23cb5c8d434456b5b07
--- /dev/null
+++ b/configs/_base_/models/res2net50-w26-s8.py
@@ -0,0 +1,18 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Res2Net',
+ depth=50,
+ scales=8,
+ base_width=26,
+ deep_stem=False,
+ avg_down=False,
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/res2net50-w48-s2.py b/configs/_base_/models/res2net50-w48-s2.py
new file mode 100644
index 0000000000000000000000000000000000000000..8675c91fa008f72ddcaa10f11b91e1f6feb79953
--- /dev/null
+++ b/configs/_base_/models/res2net50-w48-s2.py
@@ -0,0 +1,18 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Res2Net',
+ depth=50,
+ scales=2,
+ base_width=48,
+ deep_stem=False,
+ avg_down=False,
+ ),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnest101.py b/configs/_base_/models/resnest101.py
new file mode 100644
index 0000000000000000000000000000000000000000..3780c1549359ec1850ce1db546d23a667e699d4f
--- /dev/null
+++ b/configs/_base_/models/resnest101.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNeSt',
+ depth=101,
+ num_stages=4,
+ stem_channels=128,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ num_classes=1000,
+ reduction='mean',
+ loss_weight=1.0),
+ topk=(1, 5),
+ cal_acc=False),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnest200.py b/configs/_base_/models/resnest200.py
new file mode 100644
index 0000000000000000000000000000000000000000..40d8f03e7f528f8c0132bd2c19515460fd47fe70
--- /dev/null
+++ b/configs/_base_/models/resnest200.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNeSt',
+ depth=200,
+ num_stages=4,
+ stem_channels=128,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ num_classes=1000,
+ reduction='mean',
+ loss_weight=1.0),
+ topk=(1, 5),
+ cal_acc=False),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnest269.py b/configs/_base_/models/resnest269.py
new file mode 100644
index 0000000000000000000000000000000000000000..c37626f5678630383693d784d2590f27caa11de2
--- /dev/null
+++ b/configs/_base_/models/resnest269.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNeSt',
+ depth=269,
+ num_stages=4,
+ stem_channels=128,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ num_classes=1000,
+ reduction='mean',
+ loss_weight=1.0),
+ topk=(1, 5),
+ cal_acc=False),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnest50.py b/configs/_base_/models/resnest50.py
new file mode 100644
index 0000000000000000000000000000000000000000..51c90e86f468edccc3de3b0e7cd783548d220db4
--- /dev/null
+++ b/configs/_base_/models/resnest50.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNeSt',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ num_classes=1000,
+ reduction='mean',
+ loss_weight=1.0),
+ topk=(1, 5),
+ cal_acc=False),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnet101.py b/configs/_base_/models/resnet101.py
new file mode 100644
index 0000000000000000000000000000000000000000..1147cd4be9aff00ad6ce66c31e2839c1a94f9ca3
--- /dev/null
+++ b/configs/_base_/models/resnet101.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=101,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnet101_cifar.py b/configs/_base_/models/resnet101_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..a84d470e3a9828532e5cddcb1a3f7aa4fcae9f68
--- /dev/null
+++ b/configs/_base_/models/resnet101_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=101,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=10,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/resnet152.py b/configs/_base_/models/resnet152.py
new file mode 100644
index 0000000000000000000000000000000000000000..94a718c3cec213727a7a2f11baeb3594fd37532e
--- /dev/null
+++ b/configs/_base_/models/resnet152.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=152,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnet152_cifar.py b/configs/_base_/models/resnet152_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..55c0cc6c66dbde26bebe6d99d791c3e3f28e4e27
--- /dev/null
+++ b/configs/_base_/models/resnet152_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=152,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=10,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/resnet18.py b/configs/_base_/models/resnet18.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c66758ee4aadced38c815e98af68b74aa310a2e
--- /dev/null
+++ b/configs/_base_/models/resnet18.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=18,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnet18_cifar.py b/configs/_base_/models/resnet18_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b9cf1e7337de73aa21515547b6c3d16e2b178ea
--- /dev/null
+++ b/configs/_base_/models/resnet18_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=18,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=10,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/resnet34.py b/configs/_base_/models/resnet34.py
new file mode 100644
index 0000000000000000000000000000000000000000..100ee286bead6b5dd88f1752660e8ab9d0498e37
--- /dev/null
+++ b/configs/_base_/models/resnet34.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=34,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnet34_cifar.py b/configs/_base_/models/resnet34_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..55d033bc30bcbde7aef8e57ad950f59c248ad74b
--- /dev/null
+++ b/configs/_base_/models/resnet34_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=34,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=10,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/resnet34_gem.py b/configs/_base_/models/resnet34_gem.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c0e0d3e8dc5d7a0b259f1624ee2402af8a401cd
--- /dev/null
+++ b/configs/_base_/models/resnet34_gem.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=34,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GeneralizedMeanPooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnet50.py b/configs/_base_/models/resnet50.py
new file mode 100644
index 0000000000000000000000000000000000000000..129a2bb50c91f3034997d216f3a9efb743d9cc40
--- /dev/null
+++ b/configs/_base_/models/resnet50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnet50_cifar.py b/configs/_base_/models/resnet50_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..33b66d526482245237faa2862d376797c21a8ee4
--- /dev/null
+++ b/configs/_base_/models/resnet50_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=10,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/resnet50_cifar_cutmix.py b/configs/_base_/models/resnet50_cifar_cutmix.py
new file mode 100644
index 0000000000000000000000000000000000000000..73c38be271a90b1655ae63e4f36cf6c3a3c5fdc4
--- /dev/null
+++ b/configs/_base_/models/resnet50_cifar_cutmix.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='MultiLabelLinearClsHead',
+ num_classes=10,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+ train_cfg=dict(
+ augments=dict(type='BatchCutMix', alpha=1.0, num_classes=10,
+ prob=1.0)))
diff --git a/configs/_base_/models/resnet50_cifar_mixup.py b/configs/_base_/models/resnet50_cifar_mixup.py
new file mode 100644
index 0000000000000000000000000000000000000000..f165c2466bd8a67cbfadd5f3a388d4fe03e6d446
--- /dev/null
+++ b/configs/_base_/models/resnet50_cifar_mixup.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='MultiLabelLinearClsHead',
+ num_classes=10,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=1.)),
+)
diff --git a/configs/_base_/models/resnet50_cutmix.py b/configs/_base_/models/resnet50_cutmix.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb79088b798d1c16eb6c336006143c2fe288e6a2
--- /dev/null
+++ b/configs/_base_/models/resnet50_cutmix.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='MultiLabelLinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+ train_cfg=dict(
+ augments=dict(
+ type='BatchCutMix', alpha=1.0, num_classes=1000, prob=1.0)))
diff --git a/configs/_base_/models/resnet50_label_smooth.py b/configs/_base_/models/resnet50_label_smooth.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6f793751904658b3e7e01a5ffdaa6b86e156e66
--- /dev/null
+++ b/configs/_base_/models/resnet50_label_smooth.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnet50_mixup.py b/configs/_base_/models/resnet50_mixup.py
new file mode 100644
index 0000000000000000000000000000000000000000..23130a69c98823a6979dcd7ee7441746753a9865
--- /dev/null
+++ b/configs/_base_/models/resnet50_mixup.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='MultiLabelLinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnetv1c50.py b/configs/_base_/models/resnetv1c50.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b973e20181cd3cf1c470db84abf97aeaa0549c1
--- /dev/null
+++ b/configs/_base_/models/resnetv1c50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNetV1c',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnetv1d101.py b/configs/_base_/models/resnetv1d101.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e56223121fb22ac089800ebeb69310758d0f2e7
--- /dev/null
+++ b/configs/_base_/models/resnetv1d101.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNetV1d',
+ depth=101,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnetv1d152.py b/configs/_base_/models/resnetv1d152.py
new file mode 100644
index 0000000000000000000000000000000000000000..58cc73beb318e38f9ce79154a1265be1a7dba17b
--- /dev/null
+++ b/configs/_base_/models/resnetv1d152.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNetV1d',
+ depth=152,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnetv1d50.py b/configs/_base_/models/resnetv1d50.py
new file mode 100644
index 0000000000000000000000000000000000000000..015aaa3d8182cae50f392d7103e24e8ac8a188aa
--- /dev/null
+++ b/configs/_base_/models/resnetv1d50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNetV1d',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnext101_32x4d.py b/configs/_base_/models/resnext101_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c89fb6488701c83f12e623ae606abbe3b78799f
--- /dev/null
+++ b/configs/_base_/models/resnext101_32x4d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNeXt',
+ depth=101,
+ num_stages=4,
+ out_indices=(3, ),
+ groups=32,
+ width_per_group=4,
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnext101_32x8d.py b/configs/_base_/models/resnext101_32x8d.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bb63f3aeb8b37eb701135ed1c6bf2d15869fae3
--- /dev/null
+++ b/configs/_base_/models/resnext101_32x8d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNeXt',
+ depth=101,
+ num_stages=4,
+ out_indices=(3, ),
+ groups=32,
+ width_per_group=8,
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnext152_32x4d.py b/configs/_base_/models/resnext152_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..d392eff3dc673b0b74ed013c030152a0107799a2
--- /dev/null
+++ b/configs/_base_/models/resnext152_32x4d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNeXt',
+ depth=152,
+ num_stages=4,
+ out_indices=(3, ),
+ groups=32,
+ width_per_group=4,
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/resnext50_32x4d.py b/configs/_base_/models/resnext50_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..060426231e8cd845fda17ea053478cf7f57b940a
--- /dev/null
+++ b/configs/_base_/models/resnext50_32x4d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNeXt',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ groups=32,
+ width_per_group=4,
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/revvit/revvit-base.py b/configs/_base_/models/revvit/revvit-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..85b7af42ea7fd6856fd81bc99ee829fb40bce435
--- /dev/null
+++ b/configs/_base_/models/revvit/revvit-base.py
@@ -0,0 +1,27 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RevVisionTransformer',
+ arch='deit-base',
+ img_size=224,
+ patch_size=16,
+ out_type='avg_featmap',
+ ),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/revvit/revvit-small.py b/configs/_base_/models/revvit/revvit-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd1a0b2661ac2cf54554c06bd729477b94dad908
--- /dev/null
+++ b/configs/_base_/models/revvit/revvit-small.py
@@ -0,0 +1,27 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RevVisionTransformer',
+ arch='deit-small',
+ img_size=224,
+ patch_size=16,
+ out_type='avg_featmap',
+ ),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/seresnet101.py b/configs/_base_/models/seresnet101.py
new file mode 100644
index 0000000000000000000000000000000000000000..137a6f90f6bca160a073877fc43ea6398fa1d0b4
--- /dev/null
+++ b/configs/_base_/models/seresnet101.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SEResNet',
+ depth=101,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/seresnet50.py b/configs/_base_/models/seresnet50.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5f6bfce8db9ed75936229bf57992a0211a95b7d
--- /dev/null
+++ b/configs/_base_/models/seresnet50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SEResNet',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/seresnext101_32x4d.py b/configs/_base_/models/seresnext101_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc8a62c39305993bf9b717edf980a1546de12a2b
--- /dev/null
+++ b/configs/_base_/models/seresnext101_32x4d.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SEResNeXt',
+ depth=101,
+ num_stages=4,
+ out_indices=(3, ),
+ groups=32,
+ width_per_group=4,
+ se_ratio=16,
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/seresnext50_32x4d.py b/configs/_base_/models/seresnext50_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cdf7cb696be22d3a5fa5829162052c8b9b7e7a8
--- /dev/null
+++ b/configs/_base_/models/seresnext50_32x4d.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SEResNeXt',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ groups=32,
+ width_per_group=4,
+ se_ratio=16,
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/shufflenet_v1_1x.py b/configs/_base_/models/shufflenet_v1_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0f9d1fbdde759e6c13d9a02705072b3f11faf02
--- /dev/null
+++ b/configs/_base_/models/shufflenet_v1_1x.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ShuffleNetV1', groups=3),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=960,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/shufflenet_v2_1x.py b/configs/_base_/models/shufflenet_v2_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..190800e343d75a89ffb67a1f7dd33db04d26429d
--- /dev/null
+++ b/configs/_base_/models/shufflenet_v2_1x.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='ShuffleNetV2', widen_factor=1.0),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/swin_transformer/base_224.py b/configs/_base_/models/swin_transformer/base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7c277f2d6494a6d069bcf053349d8c5df2a0bc3
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/base_224.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformer', arch='base', img_size=224, drop_path_rate=0.5),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/swin_transformer/base_384.py b/configs/_base_/models/swin_transformer/base_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce78981fb0775bdb4048522f32e25c58e2159160
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/base_384.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformer',
+ arch='base',
+ img_size=384,
+ stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer/large_224.py b/configs/_base_/models/swin_transformer/large_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..747d00e44d4b81383998d7f18b7ae8668bf41c5f
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/large_224.py
@@ -0,0 +1,12 @@
+# model settings
+# Only for evaluation
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='SwinTransformer', arch='large', img_size=224),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer/large_384.py b/configs/_base_/models/swin_transformer/large_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..7026f81a31de2adc445b8ce45520904205f72cee
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/large_384.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformer',
+ arch='large',
+ img_size=384,
+ stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer/small_224.py b/configs/_base_/models/swin_transformer/small_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..d87d9d9af6ce9c80581dc03925ed13b4b36893fc
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/small_224.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformer', arch='small', img_size=224,
+ drop_path_rate=0.3),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/swin_transformer/tiny_224.py b/configs/_base_/models/swin_transformer/tiny_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1781cf5f84fe9dd8386b29337a9fe4f6d717784
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/tiny_224.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformer', arch='tiny', img_size=224, drop_path_rate=0.2),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/swin_transformer/tiny_base_224.py b/configs/_base_/models/swin_transformer/tiny_base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..e353b8cf0c3e66afee351e269475dfd3b234dd2a
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/tiny_base_224.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformer', arch='base', img_size=224, drop_path_rate=0.5),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=200,
+ in_channels=1024,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/swin_transformer/tiny_large_224.py b/configs/_base_/models/swin_transformer/tiny_large_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9e3f9118a68485691f2445ea9dc46917a3ad2cf
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/tiny_large_224.py
@@ -0,0 +1,12 @@
+# model settings
+# Only for evaluation
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='SwinTransformer', arch='large', img_size=224),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=200,
+ in_channels=1536,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer_v2/base_256.py b/configs/_base_/models/swin_transformer_v2/base_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..66594db25b17a20a346fcff944f2d37d8ff860f7
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/base_256.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformerV2',
+ arch='base',
+ img_size=256,
+ drop_path_rate=0.5),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/swin_transformer_v2/base_384.py b/configs/_base_/models/swin_transformer_v2/base_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fb9aead2e98bba3f9277a02024981a1e22b6046
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/base_384.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformerV2',
+ arch='base',
+ img_size=384,
+ drop_path_rate=0.2),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False))
diff --git a/configs/_base_/models/swin_transformer_v2/large_256.py b/configs/_base_/models/swin_transformer_v2/large_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..fe557c32058be1563ed50696b9f44b95b3bb3bed
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/large_256.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformerV2',
+ arch='large',
+ img_size=256,
+ drop_path_rate=0.2),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer_v2/large_384.py b/configs/_base_/models/swin_transformer_v2/large_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..a626c40715d1ea2cb1fb0cda0a249d1df01544dc
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/large_384.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformerV2',
+ arch='large',
+ img_size=384,
+ drop_path_rate=0.2),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1536,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer_v2/small_256.py b/configs/_base_/models/swin_transformer_v2/small_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ec706ff0e16e44027fad3ee54e93280018d76bd
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/small_256.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformerV2',
+ arch='small',
+ img_size=256,
+ drop_path_rate=0.3),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/swin_transformer_v2/tiny_256.py b/configs/_base_/models/swin_transformer_v2/tiny_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..61055a1310ab86bea26d427fe445bc4cfe7bf89e
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/tiny_256.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformerV2',
+ arch='tiny',
+ img_size=256,
+ drop_path_rate=0.2),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/t2t-vit-t-14.py b/configs/_base_/models/t2t-vit-t-14.py
new file mode 100644
index 0000000000000000000000000000000000000000..58ea660e742b1ef8edf93fb10ac1331734a4dbe5
--- /dev/null
+++ b/configs/_base_/models/t2t-vit-t-14.py
@@ -0,0 +1,42 @@
+# model settings
+embed_dims = 384
+num_classes = 1000
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='T2T_ViT',
+ img_size=224,
+ in_channels=3,
+ embed_dims=embed_dims,
+ t2t_cfg=dict(
+ token_dims=64,
+ use_performer=False,
+ ),
+ num_layers=14,
+ layer_cfgs=dict(
+ num_heads=6,
+ feedforward_channels=3 * embed_dims, # mlp_ratio = 3
+ ),
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ]),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=num_classes,
+ in_channels=embed_dims,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ mode='original',
+ ),
+ topk=(1, 5),
+ init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/t2t-vit-t-19.py b/configs/_base_/models/t2t-vit-t-19.py
new file mode 100644
index 0000000000000000000000000000000000000000..51741c7a7cbcfd8f13fb1574f831978a144ca1a4
--- /dev/null
+++ b/configs/_base_/models/t2t-vit-t-19.py
@@ -0,0 +1,42 @@
+# model settings
+embed_dims = 448
+num_classes = 1000
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='T2T_ViT',
+ img_size=224,
+ in_channels=3,
+ embed_dims=embed_dims,
+ t2t_cfg=dict(
+ token_dims=64,
+ use_performer=False,
+ ),
+ num_layers=19,
+ layer_cfgs=dict(
+ num_heads=7,
+ feedforward_channels=3 * embed_dims, # mlp_ratio = 3
+ ),
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ]),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=num_classes,
+ in_channels=embed_dims,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ mode='original',
+ ),
+ topk=(1, 5),
+ init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/t2t-vit-t-24.py b/configs/_base_/models/t2t-vit-t-24.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad772cf6e614bbca630ffad75393614415102bb9
--- /dev/null
+++ b/configs/_base_/models/t2t-vit-t-24.py
@@ -0,0 +1,42 @@
+# model settings
+embed_dims = 512
+num_classes = 1000
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='T2T_ViT',
+ img_size=224,
+ in_channels=3,
+ embed_dims=embed_dims,
+ t2t_cfg=dict(
+ token_dims=64,
+ use_performer=False,
+ ),
+ num_layers=24,
+ layer_cfgs=dict(
+ num_heads=8,
+ feedforward_channels=3 * embed_dims, # mlp_ratio = 3
+ ),
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ]),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=num_classes,
+ in_channels=embed_dims,
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ mode='original',
+ ),
+ topk=(1, 5),
+ init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
diff --git a/configs/_base_/models/tiny-vit-large-p16.py b/configs/_base_/models/tiny-vit-large-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e4e7f656bc73f5b4e66610fd134950afa377ea8
--- /dev/null
+++ b/configs/_base_/models/tiny-vit-large-p16.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='l',
+ img_size=224,
+ patch_size=16,
+ drop_rate=0.1,
+ init_cfg=[
+ dict(
+ type='Kaiming',
+ layer='Conv2d',
+ mode='fan_in',
+ nonlinearity='linear')
+ ]),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=200,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/tinyvit/tinyvit-11m.py b/configs/_base_/models/tinyvit/tinyvit-11m.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c046e35a0fe11aaa679300d3a2d3be59ff1051b
--- /dev/null
+++ b/configs/_base_/models/tinyvit/tinyvit-11m.py
@@ -0,0 +1,25 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='TinyViT',
+ arch='11m',
+ img_size=(224, 224),
+ window_size=[7, 7, 14, 7],
+ out_indices=(3, ),
+ drop_path_rate=0.1,
+ gap_before_final_norm=True,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+ ]),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=448,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/tinyvit/tinyvit-21m.py b/configs/_base_/models/tinyvit/tinyvit-21m.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f362f8f62789f6442e33a5a000ce8d9a458a597
--- /dev/null
+++ b/configs/_base_/models/tinyvit/tinyvit-21m.py
@@ -0,0 +1,25 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='TinyViT',
+ arch='21m',
+ img_size=(224, 224),
+ window_size=[7, 7, 14, 7],
+ out_indices=(3, ),
+ drop_path_rate=0.2,
+ gap_before_final_norm=True,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+ ]),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=576,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/tinyvit/tinyvit-5m.py b/configs/_base_/models/tinyvit/tinyvit-5m.py
new file mode 100644
index 0000000000000000000000000000000000000000..923ebd918f82f40537e0f40f550c3cd264d7e389
--- /dev/null
+++ b/configs/_base_/models/tinyvit/tinyvit-5m.py
@@ -0,0 +1,25 @@
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='TinyViT',
+ arch='5m',
+ img_size=(224, 224),
+ window_size=[7, 7, 14, 7],
+ out_indices=(3, ),
+ drop_path_rate=0.0,
+ gap_before_final_norm=True,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+ ]),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=320,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
diff --git a/configs/_base_/models/tnt_s_patch16_224.py b/configs/_base_/models/tnt_s_patch16_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e13d07828c5d89d0e9ce4fc1a29fe7a6a4875d4
--- /dev/null
+++ b/configs/_base_/models/tnt_s_patch16_224.py
@@ -0,0 +1,29 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='TNT',
+ arch='s',
+ img_size=224,
+ patch_size=16,
+ in_channels=3,
+ ffn_ratio=4,
+ qkv_bias=False,
+ drop_rate=0.,
+ attn_drop_rate=0.,
+ drop_path_rate=0.1,
+ first_stride=4,
+ num_fcs=2,
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ]),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ topk=(1, 5),
+ init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)))
diff --git a/configs/_base_/models/twins_pcpvt_base.py b/configs/_base_/models/twins_pcpvt_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..14e46baedd273bd3baef163e2966653626170a1c
--- /dev/null
+++ b/configs/_base_/models/twins_pcpvt_base.py
@@ -0,0 +1,31 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PCPVT',
+ arch='base',
+ in_channels=3,
+ out_indices=(3, ),
+ qkv_bias=True,
+ norm_cfg=dict(type='LN', eps=1e-06),
+ norm_after_stage=[False, False, False, True],
+ drop_rate=0.0,
+ attn_drop_rate=0.,
+ drop_path_rate=0.3),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/twins_svt_base.py b/configs/_base_/models/twins_svt_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..a37385b018f9b345ebcd3a9aaad575cd98e8b8f3
--- /dev/null
+++ b/configs/_base_/models/twins_svt_base.py
@@ -0,0 +1,31 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SVT',
+ arch='base',
+ in_channels=3,
+ out_indices=(3, ),
+ qkv_bias=True,
+ norm_cfg=dict(type='LN'),
+ norm_after_stage=[False, False, False, True],
+ drop_rate=0.0,
+ attn_drop_rate=0.,
+ drop_path_rate=0.3),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/van/van_base.py b/configs/_base_/models/van/van_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..006459255f82f4ad4250ee01f1d9d25605beb5d1
--- /dev/null
+++ b/configs/_base_/models/van/van_base.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='VAN', arch='base', drop_path_rate=0.1),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False))
diff --git a/configs/_base_/models/van/van_large.py b/configs/_base_/models/van/van_large.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ebafabdaaf7a4b828919e61e980e423385897e6
--- /dev/null
+++ b/configs/_base_/models/van/van_large.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='VAN', arch='large', drop_path_rate=0.2),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False))
diff --git a/configs/_base_/models/van/van_small.py b/configs/_base_/models/van/van_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..29393c6308af0732f4757d1ef4bd98d7b3cddcf1
--- /dev/null
+++ b/configs/_base_/models/van/van_small.py
@@ -0,0 +1,22 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='VAN', arch='small', drop_path_rate=0.1),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/van/van_tiny.py b/configs/_base_/models/van/van_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cf5b28836f9216c642dfdfb62f37f3066a7ad09
--- /dev/null
+++ b/configs/_base_/models/van/van_tiny.py
@@ -0,0 +1,22 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='VAN', arch='tiny', drop_path_rate=0.1),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=256,
+ init_cfg=None, # suppress the default init_cfg of LinearClsHead.
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ cal_acc=False),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/vgg11.py b/configs/_base_/models/vgg11.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b6ee1426aae383b1db5c4451e37caec5eafdcfa
--- /dev/null
+++ b/configs/_base_/models/vgg11.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='VGG', depth=11, num_classes=1000),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vgg11bn.py b/configs/_base_/models/vgg11bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb4c64e95a85367841615fd52af7af50b5b1e9fb
--- /dev/null
+++ b/configs/_base_/models/vgg11bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VGG', depth=11, norm_cfg=dict(type='BN'), num_classes=1000),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vgg13.py b/configs/_base_/models/vgg13.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9389100a61514043bbe7426b93cfd257df5cd26
--- /dev/null
+++ b/configs/_base_/models/vgg13.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='VGG', depth=13, num_classes=1000),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vgg13bn.py b/configs/_base_/models/vgg13bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..b12173b51b80b671fd85c9fa8ececd75881d4bd2
--- /dev/null
+++ b/configs/_base_/models/vgg13bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VGG', depth=13, norm_cfg=dict(type='BN'), num_classes=1000),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vgg16.py b/configs/_base_/models/vgg16.py
new file mode 100644
index 0000000000000000000000000000000000000000..93ce864fac29a7c4adf4df12e5653f97ce09d7be
--- /dev/null
+++ b/configs/_base_/models/vgg16.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='VGG', depth=16, num_classes=1000),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vgg16bn.py b/configs/_base_/models/vgg16bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..765e34f6367bc52e10322692a849d1003d57dfd2
--- /dev/null
+++ b/configs/_base_/models/vgg16bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VGG', depth=16, norm_cfg=dict(type='BN'), num_classes=1000),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vgg19.py b/configs/_base_/models/vgg19.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f4ab061b2c7a87d86aaebcf78aaf84abd2bb0cc
--- /dev/null
+++ b/configs/_base_/models/vgg19.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='VGG', depth=19, num_classes=1000),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vgg19bn.py b/configs/_base_/models/vgg19bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..c468b5dea2cc5503ca2b266c57d163b2308b7dd3
--- /dev/null
+++ b/configs/_base_/models/vgg19bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VGG', depth=19, norm_cfg=dict(type='BN'), num_classes=1000),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vig/pyramid_vig_base.py b/configs/_base_/models/vig/pyramid_vig_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..a258457c84aecc2f1cdf29131f60b522526dbdd8
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_base.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PyramidVig',
+ arch='base',
+ k=9,
+ act_cfg=dict(type='GELU'),
+ norm_cfg=dict(type='BN'),
+ graph_conv_type='mr',
+ graph_conv_bias=True,
+ epsilon=0.2,
+ use_stochastic=False,
+ drop_path=0.1,
+ norm_eval=False,
+ frozen_stages=0),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='VigClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ hidden_dim=1024,
+ act_cfg=dict(type='GELU'),
+ dropout=0.,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/vig/pyramid_vig_medium.py b/configs/_base_/models/vig/pyramid_vig_medium.py
new file mode 100644
index 0000000000000000000000000000000000000000..a551aba3e079576e13f5db3a77d5e6622079e497
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_medium.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PyramidVig',
+ arch='medium',
+ k=9,
+ act_cfg=dict(type='GELU'),
+ norm_cfg=dict(type='BN'),
+ graph_conv_type='mr',
+ graph_conv_bias=True,
+ epsilon=0.2,
+ use_stochastic=False,
+ drop_path=0.1,
+ norm_eval=False,
+ frozen_stages=0),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='VigClsHead',
+ num_classes=1000,
+ in_channels=768,
+ hidden_dim=1024,
+ act_cfg=dict(type='GELU'),
+ dropout=0.,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/vig/pyramid_vig_small.py b/configs/_base_/models/vig/pyramid_vig_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..940275e6cf941ce0d6a7f7dc3e4a1b867cf88309
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_small.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PyramidVig',
+ arch='small',
+ k=9,
+ act_cfg=dict(type='GELU'),
+ norm_cfg=dict(type='BN'),
+ graph_conv_type='mr',
+ graph_conv_bias=True,
+ epsilon=0.2,
+ use_stochastic=False,
+ drop_path=0.1,
+ norm_eval=False,
+ frozen_stages=0),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='VigClsHead',
+ num_classes=1000,
+ in_channels=640,
+ hidden_dim=1024,
+ act_cfg=dict(type='GELU'),
+ dropout=0.,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/vig/pyramid_vig_tiny.py b/configs/_base_/models/vig/pyramid_vig_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..fea0734fe9ab2e962e51b819c467ad965b88a958
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_tiny.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='PyramidVig',
+ arch='tiny',
+ k=9,
+ act_cfg=dict(type='GELU'),
+ norm_cfg=dict(type='BN'),
+ graph_conv_type='mr',
+ graph_conv_bias=True,
+ epsilon=0.2,
+ use_stochastic=False,
+ drop_path=0.1,
+ norm_eval=False,
+ frozen_stages=0),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='VigClsHead',
+ num_classes=1000,
+ in_channels=384,
+ hidden_dim=1024,
+ act_cfg=dict(type='GELU'),
+ dropout=0.,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/vig/vig_base.py b/configs/_base_/models/vig/vig_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c5f293ddfab1e8712c90f96aaa62acf62159e65
--- /dev/null
+++ b/configs/_base_/models/vig/vig_base.py
@@ -0,0 +1,33 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Vig',
+ arch='base',
+ k=9,
+ act_cfg=dict(type='GELU'),
+ norm_cfg=dict(type='BN'),
+ graph_conv_type='mr',
+ graph_conv_bias=True,
+ epsilon=0.2,
+ use_dilation=True,
+ use_stochastic=False,
+ drop_path=0.1,
+ relative_pos=False,
+ norm_eval=False,
+ frozen_stages=0),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='VigClsHead',
+ num_classes=1000,
+ in_channels=640,
+ hidden_dim=1024,
+ act_cfg=dict(type='GELU'),
+ dropout=0.,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/vig/vig_small.py b/configs/_base_/models/vig/vig_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..93587ffba628d8900b17a537eed1406c7af57e9a
--- /dev/null
+++ b/configs/_base_/models/vig/vig_small.py
@@ -0,0 +1,33 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Vig',
+ arch='small',
+ k=9,
+ act_cfg=dict(type='GELU'),
+ norm_cfg=dict(type='BN'),
+ graph_conv_type='mr',
+ graph_conv_bias=True,
+ epsilon=0.2,
+ use_dilation=True,
+ use_stochastic=False,
+ drop_path=0.1,
+ relative_pos=False,
+ norm_eval=False,
+ frozen_stages=0),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='VigClsHead',
+ num_classes=1000,
+ in_channels=320,
+ hidden_dim=1024,
+ act_cfg=dict(type='GELU'),
+ dropout=0.,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/vig/vig_tiny.py b/configs/_base_/models/vig/vig_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..c50bac222a88a665a1b7adc8398f805ff10be7f1
--- /dev/null
+++ b/configs/_base_/models/vig/vig_tiny.py
@@ -0,0 +1,33 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='Vig',
+ arch='tiny',
+ k=9,
+ act_cfg=dict(type='GELU'),
+ norm_cfg=dict(type='BN'),
+ graph_conv_type='mr',
+ graph_conv_bias=True,
+ epsilon=0.2,
+ use_dilation=True,
+ use_stochastic=False,
+ drop_path=0.1,
+ relative_pos=False,
+ norm_eval=False,
+ frozen_stages=0),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='VigClsHead',
+ num_classes=1000,
+ in_channels=192,
+ hidden_dim=1024,
+ act_cfg=dict(type='GELU'),
+ dropout=0.,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
diff --git a/configs/_base_/models/vit-base-p16.py b/configs/_base_/models/vit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb42bed5fa5ecedf9aa94c82ee63462181df0605
--- /dev/null
+++ b/configs/_base_/models/vit-base-p16.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=224,
+ patch_size=16,
+ drop_rate=0.1,
+ init_cfg=[
+ dict(
+ type='Kaiming',
+ layer='Conv2d',
+ mode='fan_in',
+ nonlinearity='linear')
+ ]),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1,
+ mode='classy_vision'),
+ ))
diff --git a/configs/_base_/models/vit-base-p32.py b/configs/_base_/models/vit-base-p32.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad550ef9b9bdbb218e6743ccf37e7929e5758865
--- /dev/null
+++ b/configs/_base_/models/vit-base-p32.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=224,
+ patch_size=32,
+ drop_rate=0.1,
+ init_cfg=[
+ dict(
+ type='Kaiming',
+ layer='Conv2d',
+ mode='fan_in',
+ nonlinearity='linear')
+ ]),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vit-large-p16.py b/configs/_base_/models/vit-large-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..97162304563827716366d20bd29a11fed542be62
--- /dev/null
+++ b/configs/_base_/models/vit-large-p16.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='l',
+ img_size=224,
+ patch_size=16,
+ drop_rate=0.1,
+ init_cfg=[
+ dict(
+ type='Kaiming',
+ layer='Conv2d',
+ mode='fan_in',
+ nonlinearity='linear')
+ ]),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/vit-large-p32.py b/configs/_base_/models/vit-large-p32.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9491bb561433ff01f60a8aa7a4993c28c8b9b02
--- /dev/null
+++ b/configs/_base_/models/vit-large-p32.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='l',
+ img_size=224,
+ patch_size=32,
+ drop_rate=0.1,
+ init_cfg=[
+ dict(
+ type='Kaiming',
+ layer='Conv2d',
+ mode='fan_in',
+ nonlinearity='linear')
+ ]),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/models/wide-resnet50.py b/configs/_base_/models/wide-resnet50.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2913b9aa6afb10c36199530441ab39348650bc7
--- /dev/null
+++ b/configs/_base_/models/wide-resnet50.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ stem_channels=64,
+ base_channels=128,
+ expansion=2,
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
diff --git a/configs/_base_/schedules/cifar10_bs128.py b/configs/_base_/schedules/cifar10_bs128.py
new file mode 100644
index 0000000000000000000000000000000000000000..fadb6c1285515b0d0ee7c2c17c3a9d19f4a63713
--- /dev/null
+++ b/configs/_base_/schedules/cifar10_bs128.py
@@ -0,0 +1,15 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+# learning policy
+param_scheduler = dict(
+ type='MultiStepLR', by_epoch=True, milestones=[100, 150], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=200, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/_base_/schedules/cub_bs64.py b/configs/_base_/schedules/cub_bs64.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d0b4be7bd7b7043636fb2356b76512281a37e2b
--- /dev/null
+++ b/configs/_base_/schedules/cub_bs64.py
@@ -0,0 +1,34 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(
+ type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005, nesterov=True))
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.01,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ by_epoch=True,
+ begin=5,
+ end=100,
+ )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=64)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_conformer.py b/configs/_base_/schedules/imagenet_bs1024_adamw_conformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..2285d0ea6c70de222a76d6b7404fc16e5fd28e0e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_conformer.py
@@ -0,0 +1,43 @@
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW',
+ # for batch in each gpu is 128, 8 gpu
+ # lr = 5e-4 * 128 * 8 / 512 = 0.001
+ lr=5e-4 * 128 * 8 / 512,
+ weight_decay=0.05,
+ eps=1e-8,
+ betas=(0.9, 0.999)),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ }),
+)
+
+# learning policy
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=295,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=5,
+ end=300)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_hivit.py b/configs/_base_/schedules/imagenet_bs1024_adamw_hivit.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b2df97b813d1c3922dd470d2f0743eca44221ee
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_hivit.py
@@ -0,0 +1,41 @@
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW',
+ lr=5e-4 * 1024 / 512,
+ weight_decay=0.05,
+ eps=1e-8,
+ betas=(0.9, 0.999)),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ flat_decay_mult=0.0,
+ custom_keys={
+ '.pos_embed': dict(decay_mult=0.0),
+ '.relative_position_bias_table': dict(decay_mult=0.0)
+ }),
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_revvit.py b/configs/_base_/schedules/imagenet_bs1024_adamw_revvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..87fd202ce4076a69cae63f0d9d3f6b860639ff49
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_revvit.py
@@ -0,0 +1,41 @@
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW',
+ lr=5e-4 * 2048 / 512,
+ weight_decay=0.05,
+ eps=1e-8,
+ betas=(0.9, 0.999)),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=1.0),
+)
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-8 / 2e-3,
+ by_epoch=True,
+ end=70,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py b/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd06cc115a7ab4cbaa7ef7fa1d9366bdd5db878f
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py
@@ -0,0 +1,41 @@
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW',
+ lr=5e-4 * 1024 / 512,
+ weight_decay=0.05,
+ eps=1e-8,
+ betas=(0.9, 0.999)),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ flat_decay_mult=0.0,
+ custom_keys={
+ '.absolute_pos_embed': dict(decay_mult=0.0),
+ '.relative_position_bias_table': dict(decay_mult=0.0)
+ }),
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_coslr.py b/configs/_base_/schedules/imagenet_bs1024_coslr.py
new file mode 100644
index 0000000000000000000000000000000000000000..285884d0b2b132329bab682f4418d891d7378ec1
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_coslr.py
@@ -0,0 +1,18 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=5e-5))
+
+# learning policy
+param_scheduler = [
+ dict(type='LinearLR', start_factor=0.1, by_epoch=True, begin=0, end=5),
+ dict(type='CosineAnnealingLR', T_max=95, by_epoch=True, begin=5, end=100)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py b/configs/_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf38d4731c867ac381ff0420b0063f8a7e7dfe2e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py
@@ -0,0 +1,20 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.5, momentum=0.9, weight_decay=0.00004),
+ paramwise_cfg=dict(norm_decay_mult=0),
+)
+
+# learning policy
+param_scheduler = [
+ dict(type='ConstantLR', factor=0.1, by_epoch=False, begin=0, end=5000),
+ dict(type='PolyLR', eta_min=0, by_epoch=False, begin=5000)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs2048.py b/configs/_base_/schedules/imagenet_bs2048.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cfbfbe6752d923c248b92f3c7b7ace817bad9a4
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048.py
@@ -0,0 +1,21 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(
+ type='SGD', lr=0.8, momentum=0.9, weight_decay=0.0001, nesterov=True))
+
+# learning policy
+param_scheduler = [
+ dict(
+ type='LinearLR', start_factor=0.25, by_epoch=False, begin=0, end=2500),
+ dict(
+ type='MultiStepLR', by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_AdamW.py b/configs/_base_/schedules/imagenet_bs2048_AdamW.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbfae8ef222b10663e1313000d05290d729ca5c8
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_AdamW.py
@@ -0,0 +1,39 @@
+# optimizer
+# In ClassyVision, the lr is set to 0.003 for bs4096.
+# In this implementation(bs2048), lr = 0.003 / 4096 * (32bs * 64gpus) = 0.0015
+optim_wrapper = dict(
+ optimizer=dict(type='AdamW', lr=0.0015, weight_decay=0.3),
+ # specific to vit pretrain
+ paramwise_cfg=dict(custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }),
+)
+
+# learning policy
+warmup_epochs = 15 # about 10000 iterations for ImageNet-1k
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=warmup_epochs,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=warmup_epochs)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_adamw_levit.py b/configs/_base_/schedules/imagenet_bs2048_adamw_levit.py
new file mode 100644
index 0000000000000000000000000000000000000000..25a536eaac52f1c42b37e0d0b102da252deebd67
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_adamw_levit.py
@@ -0,0 +1,40 @@
+# for batch in each gpu is 256, 8 gpu
+# lr = 5e-4 * 256 * 8 / 512 = 0.002
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW',
+ lr=0.002,
+ weight_decay=0.025,
+ eps=1e-8,
+ betas=(0.9, 0.999)),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.attention_biases': dict(decay_mult=0.0),
+ }),
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-6 / 0.002,
+ by_epoch=True,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True,
+ ),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=5)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=1000)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_coslr.py b/configs/_base_/schedules/imagenet_bs2048_coslr.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8551f55c8082ba07c084324c2bf1fbb9f26ea56
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_coslr.py
@@ -0,0 +1,35 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(
+ type='SGD', lr=0.8, momentum=0.9, weight_decay=0.0001, nesterov=True))
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.25,
+ by_epoch=True,
+ begin=0,
+ # about 2500 iterations for ImageNet-1k
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ by_epoch=True,
+ begin=5,
+ end=100,
+ )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_rsb.py b/configs/_base_/schedules/imagenet_bs2048_rsb.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0d2d7994293afdc43b906c918d486397dc53206
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_rsb.py
@@ -0,0 +1,32 @@
+# optimizer
+optim_wrapper = dict(optimizer=dict(type='Lamb', lr=0.005, weight_decay=0.02))
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ eta_min=1.0e-6,
+ by_epoch=True,
+ begin=5,
+ end=100)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs256.py b/configs/_base_/schedules/imagenet_bs256.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f92273d1b831ae5cd6663cfe65b1f0d8f01e630
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256.py
@@ -0,0 +1,16 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+ type='MultiStepLR', by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_140e.py b/configs/_base_/schedules/imagenet_bs256_140e.py
new file mode 100644
index 0000000000000000000000000000000000000000..e65bf522d9739073baf38db7f10e6b27d7cd4f31
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_140e.py
@@ -0,0 +1,16 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+ type='MultiStepLR', by_epoch=True, milestones=[40, 80, 120], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=140, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_200e_coslr_warmup.py b/configs/_base_/schedules/imagenet_bs256_200e_coslr_warmup.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8d94a7606aead6d4142bf8a61228eb6475eb5c6
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_200e_coslr_warmup.py
@@ -0,0 +1,34 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.25,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True,
+ ),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=195,
+ by_epoch=True,
+ begin=5,
+ end=200,
+ )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=200, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_coslr.py b/configs/_base_/schedules/imagenet_bs256_coslr.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e2c8bb5d0800568bb3c7079b9e0c3e1322711c
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_coslr.py
@@ -0,0 +1,16 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+ type='CosineAnnealingLR', T_max=100, by_epoch=True, begin=0, end=100)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_coslr_coswd_300e.py b/configs/_base_/schedules/imagenet_bs256_coslr_coswd_300e.py
new file mode 100644
index 0000000000000000000000000000000000000000..318e031574367aa9d34ec28453deccc60377372f
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_coslr_coswd_300e.py
@@ -0,0 +1,40 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.001,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=295,
+ eta_min=1.0e-6,
+ by_epoch=True,
+ begin=5,
+ end=300),
+ dict(
+ type='CosineAnnealingParamScheduler',
+ param_name='weight_decay',
+ eta_min=0.00001,
+ by_epoch=True,
+ begin=0,
+ end=300)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_epochstep.py b/configs/_base_/schedules/imagenet_bs256_epochstep.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8c2b905bf362022d07d452df76c10cccfb6565e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_epochstep.py
@@ -0,0 +1,15 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.045, momentum=0.9, weight_decay=0.00004))
+
+# learning policy
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=1, gamma=0.98)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs4096_AdamW.py b/configs/_base_/schedules/imagenet_bs4096_AdamW.py
new file mode 100644
index 0000000000000000000000000000000000000000..84b1f39beaef86b412c159a54d74c4f09458dc57
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs4096_AdamW.py
@@ -0,0 +1,39 @@
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(type='AdamW', lr=0.003, weight_decay=0.3),
+ # specific to vit pretrain
+ paramwise_cfg=dict(custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }),
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=30,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=270,
+ by_epoch=True,
+ begin=30,
+ end=300,
+ )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/_base_/schedules/imagenet_lars_coslr_200e.py b/configs/_base_/schedules/imagenet_lars_coslr_200e.py
new file mode 100644
index 0000000000000000000000000000000000000000..baba55c4f43b60620a646c812b24e6ffcbd7860a
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_lars_coslr_200e.py
@@ -0,0 +1,20 @@
+# optimizer wrapper
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='LARS', lr=4.8, weight_decay=1e-6, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR', T_max=190, by_epoch=True, begin=10, end=200)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=200)
diff --git a/configs/_base_/schedules/imagenet_lars_coslr_90e.py b/configs/_base_/schedules/imagenet_lars_coslr_90e.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e7875a36e76eccefbf752d704fcb12beb6c6506
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_lars_coslr_90e.py
@@ -0,0 +1,14 @@
+# optimizer wrapper
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/_base_/schedules/imagenet_sgd_coslr_100e.py b/configs/_base_/schedules/imagenet_sgd_coslr_100e.py
new file mode 100644
index 0000000000000000000000000000000000000000..08e9a3e71fc0d8c186b8fdeb5bb59fd3a1d5148e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_sgd_coslr_100e.py
@@ -0,0 +1,14 @@
+# optimizer wrapper
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=0.3, momentum=0.9, weight_decay=1e-6))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(type='CosineAnnealingLR', T_max=100, by_epoch=True, begin=0, end=100)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/_base_/schedules/imagenet_sgd_coslr_200e.py b/configs/_base_/schedules/imagenet_sgd_coslr_200e.py
new file mode 100644
index 0000000000000000000000000000000000000000..f38e4983038031c9178813297dc744195e855680
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_sgd_coslr_200e.py
@@ -0,0 +1,12 @@
+# optimizer wrapper
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=0.03, weight_decay=1e-4, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(type='CosineAnnealingLR', T_max=200, by_epoch=True, begin=0, end=200)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=200)
diff --git a/configs/_base_/schedules/imagenet_sgd_steplr_100e.py b/configs/_base_/schedules/imagenet_sgd_steplr_100e.py
new file mode 100644
index 0000000000000000000000000000000000000000..75b725c7dfb074c3ebe5c7536752eb32c45b89cc
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_sgd_steplr_100e.py
@@ -0,0 +1,14 @@
+# optimizer wrapper
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=1e-4))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(type='MultiStepLR', by_epoch=True, milestones=[60, 80], gamma=0.1)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/arcface/README.md b/configs/arcface/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6b2ee6a3e6164531da343e954c9f5a20917f052d
--- /dev/null
+++ b/configs/arcface/README.md
@@ -0,0 +1,80 @@
+# ArcFace
+
+> [ArcFace: Additive Angular Margin Loss for Deep Face Recognition](https://arxiv.org/abs/1801.07698)
+
+
+
+## Abstract
+
+Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability. In this paper, we first introduce an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly enhances the discriminative power. Since ArcFace is susceptible to the massive label noise, we further propose sub-center ArcFace, in which each class contains K sub-centers and training samples only need to be close to any of the K positive sub-centers. Sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces. Based on this self-propelled isolation, we boost the performance through automatically purifying raw web faces under massive real-world noise. Besides discriminative feature embedding, we also explore the inverse problem, mapping feature vectors to face images. Without training any additional generator or discriminator, the pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and Batch Normalization (BN) priors. Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis.
+
+
+

+
+
+## How to use it?
+
+
+
+**Retrieve image**
+
+```python
+from mmpretrain import ImageRetrievalInferencer
+
+inferencer = ImageRetrievalInferencer('resnet50-arcface_inshop', prototype='demo/')
+predict = inferencer('demo/dog.jpg', topk=2)[0]
+print(predict[0])
+print(predict[1])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet50-arcface_inshop', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/arcface/resnet50-arcface_8xb32_inshop.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/arcface/resnet50-arcface_8xb32_inshop.py https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth
+```
+
+
+
+## Models and results
+
+### Image Retrieval on InShop
+
+| Model | Pretrain | Params(M) | Flops(G) | Recall@1 | mAP@10 | Config | Download |
+| :-----------------------: | :------------------------------------------------: | :-------: | :------: | :------: | :----: | :------------------------------------------: | :------------------------------------------------: |
+| `resnet50-arcface_inshop` | [ImageNet-21k-mill](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth) | 31.69 | 16.48 | 90.18 | 69.30 | [config](./resnet50-arcface_8xb32_inshop.py) | [model](https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.log) |
+
+## Citation
+
+```bibtex
+@inproceedings{deng2018arcface,
+title={ArcFace: Additive Angular Margin Loss for Deep Face Recognition},
+author={Deng, Jiankang and Guo, Jia and Niannan, Xue and Zafeiriou, Stefanos},
+booktitle={CVPR},
+year={2019}
+}
+```
diff --git a/configs/arcface/metafile.yml b/configs/arcface/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..050aba5b3e1c2980234aef13767106ed237eee12
--- /dev/null
+++ b/configs/arcface/metafile.yml
@@ -0,0 +1,28 @@
+Collections:
+ - Name: ArcFace
+ Metadata:
+ Training Data: InShop
+ Architecture:
+ - Additive Angular Margin Loss
+ Paper:
+ URL: https://arxiv.org/abs/1801.07698
+ Title: 'ArcFace: Additive Angular Margin Loss for Deep Face Recognition'
+ README: configs/arcface/README.md
+ Code:
+ Version: v1.0.0rc3
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/heads/margin_head.py
+
+Models:
+ - Name: resnet50-arcface_inshop
+ Metadata:
+ FLOPs: 16571226112
+ Parameters: 31693888
+ In Collection: ArcFace
+ Results:
+ - Dataset: InShop
+ Metrics:
+ Recall@1: 90.18
+ mAP@10: 69.30
+ Task: Image Retrieval
+ Weights: https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth
+ Config: configs/arcface/resnet50-arcface_8xb32_inshop.py
diff --git a/configs/arcface/resnet50-arcface_8xb32_inshop.py b/configs/arcface/resnet50-arcface_8xb32_inshop.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc351e7870415a687679a1970bba0c24ebc02884
--- /dev/null
+++ b/configs/arcface/resnet50-arcface_8xb32_inshop.py
@@ -0,0 +1,71 @@
+_base_ = [
+ '../_base_/datasets/inshop_bs32_448.py',
+ '../_base_/schedules/cub_bs64.py',
+ '../_base_/default_runtime.py',
+]
+
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth' # noqa
+model = dict(
+ type='ImageToImageRetriever',
+ image_encoder=[
+ dict(
+ type='ResNet',
+ depth=50,
+ init_cfg=dict(
+ type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+ dict(type='GlobalAveragePooling'),
+ ],
+ head=dict(
+ type='ArcFaceClsHead',
+ num_classes=3997,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ init_cfg=None),
+ prototype={{_base_.gallery_dataloader}})
+
+# runtime settings
+default_hooks = dict(
+ # log every 20 intervals
+ logger=dict(type='LoggerHook', interval=20),
+ # save last three checkpoints
+ checkpoint=dict(
+ type='CheckpointHook',
+ save_best='auto',
+ interval=1,
+ max_keep_ckpts=3,
+ rule='greater'))
+
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(
+ type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0005, nesterov=True))
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.01,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=45,
+ by_epoch=True,
+ begin=5,
+ end=50,
+ )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=50, val_interval=1)
+
+auto_scale_lr = dict(enable=True, base_batch_size=256)
+
+custom_hooks = [
+ dict(type='PrepareProtoBeforeValLoopHook'),
+ dict(type='SyncBuffersHook')
+]
diff --git a/configs/barlowtwins/README.md b/configs/barlowtwins/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..515d138856b170378ecfeb213aff6c582442f335
--- /dev/null
+++ b/configs/barlowtwins/README.md
@@ -0,0 +1,85 @@
+# BarlowTwins
+
+> [Barlow Twins: Self-Supervised Learning via Redundancy Reduction](https://arxiv.org/abs/2103.03230)
+
+
+
+## Abstract
+
+Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('barlowtwins_resnet50_8xb256-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :-------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :------------------------------------------------------------------------------: |
+| `barlowtwins_resnet50_8xb256-coslr-300e_in1k` | 174.54 | 4.11 | [config](barlowtwins_resnet50_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k` | [BARLOWTWINS](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) | 25.56 | 4.11 | 71.80 | [config](benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{zbontar2021barlow,
+ title={Barlow twins: Self-supervised learning via redundancy reduction},
+ author={Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St{\'e}phane},
+ booktitle={International Conference on Machine Learning},
+ year={2021},
+}
+```
diff --git a/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f12dd2e1460094e98cbc14f8bb81f67a95cb161d
--- /dev/null
+++ b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_byol.py',
+ '../_base_/default_runtime.py',
+]
+# datasets
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+ type='BarlowTwins',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=True),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=8192,
+ out_channels=8192,
+ num_layers=3,
+ with_last_bn=False,
+ with_last_bn_affine=False,
+ with_avg_pool=True,
+ init_cfg=dict(
+ type='Kaiming', distribution='uniform', layer=['Linear'])),
+ head=dict(
+ type='LatentCrossCorrelationHead',
+ in_channels=8192,
+ loss=dict(type='CrossCorrelationLoss')))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+ 'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+ # bn layer in ResNet block downsample module
+ 'downsample.1': dict(
+ decay_mult=0, lr_mult=0.024, lars_exclude=True),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1.6e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=990,
+ eta_min=0.0016,
+ by_epoch=True,
+ begin=10,
+ end=1000,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1000)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..74a7f2b9bb09a3d2cb0da644935c5f2d181bd5f4
--- /dev/null
+++ b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_byol.py',
+ '../_base_/default_runtime.py',
+]
+# datasets
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+ type='BarlowTwins',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=True),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=8192,
+ out_channels=8192,
+ num_layers=3,
+ with_last_bn=False,
+ with_last_bn_affine=False,
+ with_avg_pool=True,
+ init_cfg=dict(
+ type='Kaiming', distribution='uniform', layer=['Linear'])),
+ head=dict(
+ type='LatentCrossCorrelationHead',
+ in_channels=8192,
+ loss=dict(type='CrossCorrelationLoss')))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+ 'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+ # bn layer in ResNet block downsample module
+ 'downsample.1': dict(
+ decay_mult=0, lr_mult=0.024, lars_exclude=True),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1.6e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=290,
+ eta_min=0.0016,
+ by_epoch=True,
+ begin=10,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py b/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f4e4f574ffd130abff07f9b1e2ec22b80fbbaba
--- /dev/null
+++ b/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+ '../../_base_/models/resnet50.py',
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_sgd_coslr_100e.py',
+ '../../_base_/default_runtime.py',
+]
+
+model = dict(
+ backbone=dict(
+ frozen_stages=4,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# runtime settings
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/barlowtwins/metafile.yml b/configs/barlowtwins/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..705080e09af9c59ecc88737073deed6de170664c
--- /dev/null
+++ b/configs/barlowtwins/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+ - Name: BarlowTwins
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - LARS
+ Training Resources: 8x A100 GPUs
+ Architecture:
+ - ResNet
+ - BarlowTwins
+ Paper:
+ Title: 'Barlow Twins: Self-Supervised Learning via Redundancy Reduction'
+ URL: https://arxiv.org/abs/2103.03230
+ README: configs/barlowtwins/README.md
+
+Models:
+ - Name: barlowtwins_resnet50_8xb256-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 2048
+ FLOPs: 4109364224
+ Parameters: 174535744
+ Training Data: ImageNet-1k
+ In Collection: BarlowTwins
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth
+ Config: configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+ Downstream:
+ - resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
+ - Name: resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 256
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: BarlowTwins
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 71.8
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
+ Config: configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
diff --git a/configs/beit/README.md b/configs/beit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..404e6524a4db0e73daffd277386131717bd4106d
--- /dev/null
+++ b/configs/beit/README.md
@@ -0,0 +1,88 @@
+# BEiT
+
+> [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254)
+
+
+
+## Abstract
+
+We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_beit-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('beit_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :---------------------------------------------- | :--------: | :-------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------: |
+| `beit_beit-base-p16_8xb256-amp-coslr-300e_in1k` | 86.53 | 17.58 | [config](beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `beit-base-p16_beit-pre_8xb128-coslr-100e_in1k` | [BEIT](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) | 86.53 | 17.58 | 83.10 | N/A | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.json) |
+| `beit-base-p16_beit-in21k-pre_3rdparty_in1k`\* | BEIT ImageNet-21k | 86.53 | 17.58 | 85.28 | 97.59 | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{bao2022beit,
+ title={{BE}iT: {BERT} Pre-Training of Image Transformers},
+ author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
+ booktitle={International Conference on Learning Representations},
+ year={2022},
+}
+```
diff --git a/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5786f79ef207f1e54b9ded1903c6b3a7b632b4f3
--- /dev/null
+++ b/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,130 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='TwoNormDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ second_mean=[-31.875, -31.875, -31.875],
+ second_std=[318.75, 318.75, 318.75],
+ to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.4,
+ hue=0.),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandomResizedCropAndInterpolationWithTwoPic',
+ size=224,
+ second_size=112,
+ interpolation='bicubic',
+ second_interpolation='lanczos',
+ scale=(0.08, 1.0)),
+ dict(
+ type='BEiTMaskGenerator',
+ input_size=(14, 14),
+ num_masking_patches=75,
+ max_num_patches=None,
+ min_num_patches=16),
+ dict(type='PackInputs')
+]
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/train.txt',
+ data_prefix=dict(img_path='train/'),
+ pipeline=train_pipeline))
+
+# model settings
+model = dict(
+ type='BEiT',
+ backbone=dict(
+ type='BEiTPretrainViT',
+ arch='base',
+ patch_size=16,
+ drop_path_rate=0.1,
+ final_norm=True,
+ out_type='raw',
+ layer_scale_init_value=0.1,
+ init_cfg=[
+ dict(type='TruncNormal', std=0.02, layer='Linear'),
+ dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ]),
+ neck=None,
+ head=dict(
+ type='BEiTV1Head',
+ embed_dims=768,
+ num_embed=8192,
+ loss=dict(type='CrossEntropyLoss')),
+ target_generator=dict(
+ type='DALL-E',
+ init_cfg=dict(
+ type='Pretrained',
+ checkpoint= # noqa: E251
+ 'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/dalle_encoder.pth', # noqa: E501
+ )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+ clip_grad=dict(max_norm=3.0),
+ paramwise_cfg=dict(
+ custom_keys={
+ # the following configurations are designed for BEiT
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ 'q_bias': dict(decay_mult=0.0),
+ 'v_bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0),
+ '.gamma': dict(decay_mult=0.0),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=10,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dbab34f6e084f5c9959cfb233174a0dc059e0930
--- /dev/null
+++ b/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,127 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+data_preprocessor = dict(
+ num_classes=1000,
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ to_rgb=True,
+)
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.1,
+ out_type='avg_featmap',
+ use_abs_pos_emb=False,
+ use_rel_pos_bias=True,
+ use_shared_rel_pos_bias=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ _delete_=True,
+ layer_decay_rate=0.65,
+ custom_keys={
+ # the following configurations are designed for BEiT
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ 'q_bias': dict(decay_mult=0.0),
+ 'v_bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0),
+ '.gamma': dict(decay_mult=0.0),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ begin=20,
+ end=100,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py b/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8380b69afc061d1934fae3eba57b7f352a508b1e
--- /dev/null
+++ b/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ out_type='avg_featmap',
+ use_abs_pos_emb=False,
+ use_rel_pos_bias=True,
+ use_shared_rel_pos_bias=False,
+ ),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/beit/metafile.yml b/configs/beit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e4524faec783292836fcb2520e9cff5c2262e93d
--- /dev/null
+++ b/configs/beit/metafile.yml
@@ -0,0 +1,69 @@
+Collections:
+ - Name: BEiT
+ Metadata:
+ Architecture:
+ - Attention Dropout
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ - Tanh Activation
+ Paper:
+ Title: 'BEiT: BERT Pre-Training of Image Transformers'
+ URL: https://arxiv.org/abs/2106.08254
+ README: configs/beit/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+ Version: v1.0.0rc4
+
+Models:
+ - Name: beit_beit-base-p16_8xb256-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 2048
+ FLOPs: 17581219584
+ Parameters: 86530984
+ Training Data: ImageNet-1k
+ In Collection: BEiT
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth
+ Config: configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+ Downstream:
+ - beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
+ - Name: beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581219584
+ Parameters: 86530984
+ Training Data: ImageNet-1k
+ In Collection: BEiT
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.1
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
+ Config: configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: beit-base-p16_beit-in21k-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 17581219584
+ Parameters: 86530984
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: BEiT
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.28
+ Top 5 Accuracy: 97.59
+ Weights: https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth
+ Config: configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
+ Converted From:
+ Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_base_patch16_224_pt22k_ft22kto1k.pth
+ Code: https://github.com/microsoft/unilm/tree/master/beit
diff --git a/configs/beitv2/README.md b/configs/beitv2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5447e2d3a36e1d1e0f3d6800c4cc2e2380fdc012
--- /dev/null
+++ b/configs/beitv2/README.md
@@ -0,0 +1,90 @@
+# BEiTv2
+
+> [BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers](https://arxiv.org/abs/2208.06366)
+
+
+
+## Abstract
+
+Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :------------------------------------------------ | :--------: | :-------: | :----------------------------------------------------------: | :----------------------------------------------------------------------: |
+| `beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k` | 192.81 | 17.58 | [config](beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k` | [BEITV2](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) | 86.53 | 17.58 | 85.00 | N/A | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.json) |
+| `beit-base-p16_beitv2-in21k-pre_3rdparty_in1k`\* | BEITV2 ImageNet-21k | 86.53 | 17.58 | 86.47 | 97.99 | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{beitv2,
+ title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
+ author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
+ year={2022},
+ eprint={2208.06366},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
diff --git a/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4a2070b5de3ebbe93ed0b0658ee9157a6b62136
--- /dev/null
+++ b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
@@ -0,0 +1,119 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs256_beitv2.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+vqkd_encoder = dict(
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ in_channels=3,
+ out_indices=-1,
+ drop_rate=0.,
+ drop_path_rate=0.,
+ norm_cfg=dict(type='LN', eps=1e-6),
+ final_norm=True,
+ out_type='featmap',
+ with_cls_token=True,
+ frozen_stages=-1,
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ use_shared_rel_pos_bias=False,
+ layer_scale_init_value=0.,
+ interpolate_mode='bicubic',
+ patch_cfg=dict(),
+ layer_cfgs=dict(),
+ init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0.1 # 0. for 300 epochs and 0.1 for 1600 epochs.
+model = dict(
+ type='BEiT',
+ backbone=dict(
+ type='BEiTPretrainViT',
+ arch='base',
+ patch_size=16,
+ out_indices=[-4, -1],
+ drop_path_rate=drop_path_rate,
+ final_norm=False,
+ out_type='raw',
+ layer_scale_init_value=layer_scale_init_value,
+ init_cfg=[
+ dict(type='TruncNormal', std=0.02, layer='Linear'),
+ dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ]),
+ neck=dict(
+ type='BEiTV2Neck',
+ num_layers=2,
+ early_layers=9,
+ backbone_arch='base',
+ drop_path_rate=drop_path_rate,
+ layer_scale_init_value=layer_scale_init_value,
+ ),
+ head=dict(
+ type='BEiTV2Head',
+ embed_dims=768,
+ num_embed=8192,
+ loss=dict(type='CrossEntropyLoss')),
+ target_generator=dict(
+ type='VQKD',
+ encoder_config=vqkd_encoder,
+ init_cfg=dict(
+ type='Pretrained',
+ checkpoint= # noqa
+ 'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth' # noqa
+ )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+ optimizer=dict(
+ type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+ clip_grad=dict(max_norm=3.0),
+ paramwise_cfg=dict(
+ custom_keys={
+ # the following configurations are designed for BEiT
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ 'q_bias': dict(decay_mult=0.0),
+ 'v_bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0),
+ '.gamma': dict(decay_mult=0.0),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=10,
+ end=1600,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fddeccff1998fa850097ca4ae07b6fe874476dd0
--- /dev/null
+++ b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,119 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs256_beitv2.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+vqkd_encoder = dict(
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ in_channels=3,
+ out_indices=-1,
+ drop_rate=0.,
+ drop_path_rate=0.,
+ norm_cfg=dict(type='LN', eps=1e-6),
+ final_norm=True,
+ out_type='featmap',
+ with_cls_token=True,
+ frozen_stages=-1,
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ use_shared_rel_pos_bias=False,
+ layer_scale_init_value=0.,
+ interpolate_mode='bicubic',
+ patch_cfg=dict(),
+ layer_cfgs=dict(),
+ init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0. # 0. for 300 epochs and 0.1 for 1600 epochs.
+model = dict(
+ type='BEiT',
+ backbone=dict(
+ type='BEiTPretrainViT',
+ arch='base',
+ patch_size=16,
+ out_indices=[-4, -1],
+ drop_path_rate=drop_path_rate,
+ final_norm=False,
+ out_type='raw',
+ layer_scale_init_value=layer_scale_init_value,
+ init_cfg=[
+ dict(type='TruncNormal', std=0.02, layer='Linear'),
+ dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ]),
+ neck=dict(
+ type='BEiTV2Neck',
+ num_layers=2,
+ early_layers=9,
+ backbone_arch='base',
+ drop_path_rate=drop_path_rate,
+ layer_scale_init_value=layer_scale_init_value,
+ ),
+ head=dict(
+ type='BEiTV2Head',
+ embed_dims=768,
+ num_embed=8192,
+ loss=dict(type='CrossEntropyLoss')),
+ target_generator=dict(
+ type='VQKD',
+ encoder_config=vqkd_encoder,
+ init_cfg=dict(
+ type='Pretrained',
+ checkpoint= # noqa
+ 'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth' # noqa
+ )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+ optimizer=dict(
+ type='AdamW', lr=1.5e-3, betas=(0.9, 0.98), weight_decay=0.05),
+ clip_grad=dict(max_norm=3.0),
+ paramwise_cfg=dict(
+ custom_keys={
+ # the following configurations are designed for BEiT
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ 'q_bias': dict(decay_mult=0.0),
+ 'v_bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0),
+ '.gamma': dict(decay_mult=0.0),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=10,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2c55a706b351d5c8bd7981aaa324877cb440b11
--- /dev/null
+++ b/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,122 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ # 0.2 for 1600 epochs pretrained models and 0.1 for 300 epochs.
+ drop_path_rate=0.1,
+ out_type='avg_featmap',
+ use_abs_pos_emb=False,
+ use_rel_pos_bias=True,
+ use_shared_rel_pos_bias=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=5e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ _delete_=True,
+ # 0.6 for 1600 epochs pretrained models and 0.65 for 300 epochs
+ layer_decay_rate=0.65,
+ custom_keys={
+ # the following configurations are designed for BEiT
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ 'q_bias': dict(decay_mult=0.0),
+ 'v_bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0),
+ '.gamma': dict(decay_mult=0.0),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ begin=20,
+ end=100,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py b/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..17ed4ff3d2cf40f8d819add1b3aa4f668a41128a
--- /dev/null
+++ b/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ out_type='avg_featmap',
+ use_abs_pos_emb=False,
+ use_rel_pos_bias=True,
+ use_shared_rel_pos_bias=False,
+ ),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/beitv2/metafile.yml b/configs/beitv2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..74c3885e11cd8140cea7aac40973ade4ce4e7e64
--- /dev/null
+++ b/configs/beitv2/metafile.yml
@@ -0,0 +1,69 @@
+Collections:
+ - Name: BEiTv2
+ Metadata:
+ Architecture:
+ - Attention Dropout
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ - Tanh Activation
+ Paper:
+ Title: 'BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers'
+ URL: https://arxiv.org/abs/2208.06366
+ README: configs/beitv2/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+ Version: v1.0.0rc4
+
+Models:
+ - Name: beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 2048
+ FLOPs: 17581223424
+ Parameters: 192811376
+ Training Data: ImageNet-1k
+ In Collection: BEiTv2
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth
+ Config: configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+ Downstream:
+ - beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
+ - Name: beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581219584
+ Parameters: 86530984
+ Training Data: ImageNet-1k
+ In Collection: BEiTv2
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.0
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
+ Config: configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: beit-base-p16_beitv2-in21k-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 17581219584
+ Parameters: 86530984
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: BEiTv2
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 86.47
+ Top 5 Accuracy: 97.99
+ Weights: https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth
+ Config: configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
+ Converted From:
+ Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beitv2/beitv2_base_patch16_224_pt1k_ft21kto1k.pth
+ Code: https://github.com/microsoft/unilm/tree/master/beit2
diff --git a/configs/blip/README.md b/configs/blip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1a8dce392cb3ec3ab36eed8ab9b3af90ee0f1219
--- /dev/null
+++ b/configs/blip/README.md
@@ -0,0 +1,128 @@
+# BLIP
+
+> [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086)
+
+
+
+## Abstract
+
+Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('blip-base_3rdparty_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a puppy and a cat sitting on a blanket'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/blip/blip-base_8xb32_caption.py https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth
+```
+
+
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model | Params (M) | BLEU-4 | CIDER | Config | Download |
+| :----------------------------- | :--------: | :----: | :----: | :------------------------------------: | :------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_caption`\* | 223.97 | 40.12 | 132.82 | [config](./blip-base_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
+
+### Image Caption on NoCaps
+
+| Model | Params (M) | SPICE | CIDER | Config | Download |
+| :----------------------------- | :--------: | :---: | :----: | :-----------------------------------: | :--------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_caption`\* | 223.97 | 14.69 | 109.12 | [config](./blip-base_8xb32_nocaps.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
+
+### Image Caption on Flickr30k
+
+| Model | Params (M) | SPICE | CIDER | Config | Download |
+| :----------------------------- | :--------: | :---: | :---: | :----------------------------------------------: | :----------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_caption`\* | 223.97 | 15.58 | 68.89 | [config](./blip-base_8xb32_caption_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
+
+### Visual Grounding on RefCOCO
+
+| Model | Params (M) | Accuracy (testA) | Accuracy (testB) | Config | Download |
+| :------------------------ | :--------: | :--------------: | :--------------: | :----------------------------------: | :-----------------------------------------------------------------------------------------------: |
+| `blip-base_8xb16_refcoco` | 498.49 | 86.14 | 77.33 | [config](blip-base_8xb16_refcoco.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.pth) \| [log](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.json) |
+
+### Visual Question Answering on VQAv2
+
+| Model | Params (M) | Accuracy | Config | Download |
+| :------------------------- | :--------: | :------: | :--------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_vqa`\* | 361.48 | 78.20 | [config](./blip-base_8xb32_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
+
+### Visual Question Answering on OK-VQA
+
+| Model | Params (M) | Accuracy | Config | Download |
+| :------------------------- | :--------: | :------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_vqa`\* | 361.48 | 40.59# | [config](./blip-base_8xb32_okvqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
+
+### Visual Question Answering on OCR-VQA
+
+| Model | Params (M) | Accuracy | Config | Download |
+| :------------------------- | :--------: | :------: | :-----------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_vqa`\* | 361.48 | 28.30# | [config](./blip-base_8xb32_ocrvqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
+
+### Image-To-Text Retrieval on COCO
+
+| Model | Params (M) | Recall@1 | Recall@5 | Config | Download |
+| :------------------------------- | :--------: | :------: | :------: | :--------------------------------------: | :----------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* | 447.49 | 82.52 | 95.34 | [config](./blip-base_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### Text-To-Image Retrieval on COCO
+
+| Model | Params (M) | Recall@1 | Recall@5 | Config | Download |
+| :------------------------------- | :--------: | :------: | :------: | :--------------------------------------: | :----------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* | 447.49 | 64.82 | 86.28 | [config](./blip-base_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### Image-To-Text Retrieval on Flickr30k
+
+| Model | Params (M) | Recall@1 | Recall@5 | Config | Download |
+| :------------------------------- | :--------: | :------: | :------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* | 447.49 | 95.10# | 99.60# | [config](./blip-base_8xb32_retrieval_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### Text-To-Image Retrieval on Flickr30k
+
+| Model | Params (M) | Recall@1 | Recall@5 | Config | Download |
+| :------------------------------- | :--------: | :------: | :------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* | 447.49 | 85.26# | 96.58# | [config](./blip-base_8xb32_retrieval_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### NLVR on NLVR2
+
+| Model | Params (M) | Top-1 (%) | Config | Download |
+| :-------------------------- | :--------: | :-------: | :---------------------------------: | :------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_nlvr`\* | 259.37 | 82.33 | [config](./blip-base_8xb32_nlvr.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_nlvr_20230427-3b14d33f.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+*Results with # denote zero-shot evaluation. The corresponding model hasn't been finetuned on that dataset.*
+
+## Citation
+
+```bibtex
+@inproceedings{li2022blip,
+ title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
+ author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
+ year={2022},
+ booktitle={ICML},
+}
+```
diff --git a/configs/blip/blip-base_8xb16_refcoco.py b/configs/blip/blip-base_8xb16_refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4986143a3d6965f7176bbcea445f675cc9a80ec
--- /dev/null
+++ b/configs/blip/blip-base_8xb16_refcoco.py
@@ -0,0 +1,62 @@
+_base_ = [
+ '../_base_/datasets/refcoco.py',
+ '../_base_/default_runtime.py',
+]
+
+med_config = {
+ 'architectures': ['BertModel'],
+ 'attention_probs_dropout_prob': 0.1,
+ 'hidden_act': 'gelu',
+ 'hidden_dropout_prob': 0.1,
+ 'hidden_size': 768,
+ 'initializer_range': 0.02,
+ 'intermediate_size': 3072,
+ 'layer_norm_eps': 1e-12,
+ 'max_position_embeddings': 512,
+ 'model_type': 'bert',
+ 'num_attention_heads': 12,
+ 'num_hidden_layers': 12,
+ 'pad_token_id': 0,
+ 'add_type_embeddings': False,
+ 'vocab_size': 30524,
+ 'encoder_width': 768,
+ 'add_cross_attention': True
+}
+
+model = dict(
+ type='BlipGrounding',
+ visual_encoder=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=384,
+ patch_size=16,
+ out_type='raw',
+ ),
+ text_encoder=dict(
+ type='XBertEncoder',
+ med_config=med_config,
+ ),
+ multimodal_encoder=dict(
+ type='XBertEncoder',
+ med_config=med_config,
+ ),
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ head=dict(
+ type='GroundingHead',
+ decoder=dict(
+ type='XBertLMHeadDecoder',
+ med_config=med_config,
+ ),
+ box_l1_loss_coeff=4.0,
+ box_giou_loss_coeff=2.0,
+ ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=1.5e-5, weight_decay=0.02)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(by_epoch=True, max_epochs=120)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_caption.py b/configs/blip/blip-base_8xb32_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e24e9eababa53b17ac38502ea37eb6a9de40cf5
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_caption.py
@@ -0,0 +1,59 @@
+_base_ = [
+ '../_base_/datasets/coco_caption.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='BlipCaption',
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=384,
+ patch_size=16,
+ out_type='raw',
+ ),
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ decoder_head=dict(
+ type='SeqGenerationHead',
+ decoder=dict(
+ type='XBertLMHeadDecoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ ),
+ prompt='a picture of ',
+ max_txt_len=20,
+)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+ dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ begin=0,
+ end=10,
+ )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_caption_flickr30k.py b/configs/blip/blip-base_8xb32_caption_flickr30k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fe6ec561d6b7cd09d2490e8fb50f4f8315a14ba
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_caption_flickr30k.py
@@ -0,0 +1,59 @@
+_base_ = [
+ '../_base_/datasets/flickr30k_caption.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='BlipCaption',
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=384,
+ patch_size=16,
+ out_type='raw',
+ ),
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ decoder_head=dict(
+ type='SeqGenerationHead',
+ decoder=dict(
+ type='XBertLMHeadDecoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ ),
+ prompt='a picture of ',
+ max_txt_len=20,
+)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+ dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ begin=0,
+ end=10,
+ )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_nlvr.py b/configs/blip/blip-base_8xb32_nlvr.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a6cfe149a07b508830069ba8b8ec4e3ccccc7c0
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_nlvr.py
@@ -0,0 +1,59 @@
+_base_ = [
+ '../_base_/datasets/nlvr2.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='BlipNLVR',
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=384,
+ patch_size=16,
+ out_type='raw',
+ ),
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ multimodal_backbone=dict(
+ type='BertModel',
+ config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True,
+ nlvr=True),
+ add_pooling_layer=False),
+)
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [
+ dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ begin=0,
+ end=10,
+ )
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(logger=dict(interval=1))
diff --git a/configs/blip/blip-base_8xb32_nocaps.py b/configs/blip/blip-base_8xb32_nocaps.py
new file mode 100644
index 0000000000000000000000000000000000000000..c47c56aeec9f6b9f36b35d4ea8c078c06df586ab
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_nocaps.py
@@ -0,0 +1,46 @@
+_base_ = [
+ '../_base_/datasets/nocaps.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='BlipCaption',
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=384,
+ patch_size=16,
+ out_type='raw',
+ ),
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ decoder_head=dict(
+ type='SeqGenerationHead',
+ decoder=dict(
+ type='XBertLMHeadDecoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ ),
+ prompt='a picture of ',
+ max_txt_len=20,
+)
+
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_ocrvqa.py b/configs/blip/blip-base_8xb32_ocrvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..117d597fcb2d92aab1c0f0bc79aa895a3ab99643
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_ocrvqa.py
@@ -0,0 +1,75 @@
+_base_ = [
+ '../_base_/datasets/ocrvqa.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='BlipVQA',
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=480,
+ patch_size=16,
+ out_type='raw'),
+ multimodal_backbone=dict(
+ type='XBertEncoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ head=dict(
+ type='VQAGenerationHead',
+ decoder=dict(
+ type='XBertLMHeadDecoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ inference_method='generate',
+ ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(max_epochs=10, by_epoch=True)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+randomness = dict(seed=42)
diff --git a/configs/blip/blip-base_8xb32_okvqa.py b/configs/blip/blip-base_8xb32_okvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..548775c4e0f91128f41701042346b5d4a2567950
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_okvqa.py
@@ -0,0 +1,75 @@
+_base_ = [
+ '../_base_/datasets/coco_okvqa.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='BlipVQA',
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=480,
+ patch_size=16,
+ out_type='raw'),
+ multimodal_backbone=dict(
+ type='XBertEncoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ head=dict(
+ type='VQAGenerationHead',
+ decoder=dict(
+ type='XBertLMHeadDecoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ inference_method='generate',
+ ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(max_epochs=10, by_epoch=True)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+randomness = dict(seed=42)
diff --git a/configs/blip/blip-base_8xb32_retrieval.py b/configs/blip/blip-base_8xb32_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..645f88fd2a8e7ca06c75f603b7ad55539ef60053
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_retrieval.py
@@ -0,0 +1,83 @@
+_base_ = [
+ '../_base_/datasets/coco_retrieval.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='BlipRetrieval',
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=384,
+ patch_size=16,
+ out_type='raw',
+ ),
+ text_backbone=dict(
+ type='XBertEncoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ vision_neck=dict(
+ type='Linear',
+ in_features=768,
+ out_features=256,
+ ),
+ text_neck=dict(
+ type='Linear',
+ in_features=768,
+ out_features=256,
+ ),
+ head=dict(
+ type='ITCHead',
+ embed_dim=256,
+ ),
+ multimodal_head=dict(
+ type='ITMHead',
+ hidden_size=768,
+ with_pooler=False,
+ ),
+ topk=256,
+ max_txt_len=35,
+)
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.04)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+# learning rate scheduler
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=6)
+val_cfg = dict(type='RetrievalValLoop')
+test_cfg = dict(type='RetrievalTestLoop')
+
+randomness = dict(seed=42)
+
+default_hooks = dict(logger=dict(interval=1))
+
+custom_hooks = [
+ dict(
+ type='WarmupParamHook',
+ param_name='alpha',
+ module_name='head',
+ warmup_epochs=2)
+]
diff --git a/configs/blip/blip-base_8xb32_retrieval_flickr30k.py b/configs/blip/blip-base_8xb32_retrieval_flickr30k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d2e78e943161ec57539096aff5cbc7ae5f29186
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_retrieval_flickr30k.py
@@ -0,0 +1,83 @@
+_base_ = [
+ '../_base_/datasets/flickr30k_retrieval.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='BlipRetrieval',
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=384,
+ patch_size=16,
+ out_type='raw',
+ ),
+ text_backbone=dict(
+ type='XBertEncoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ vision_neck=dict(
+ type='Linear',
+ in_features=768,
+ out_features=256,
+ ),
+ text_neck=dict(
+ type='Linear',
+ in_features=768,
+ out_features=256,
+ ),
+ head=dict(
+ type='ITCHead',
+ embed_dim=256,
+ ),
+ multimodal_head=dict(
+ type='ITMHead',
+ hidden_size=768,
+ with_pooler=False,
+ ),
+ topk=256,
+ max_txt_len=35,
+)
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.04)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+# learning rate scheduler
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=6)
+val_cfg = dict(type='RetrievalValLoop')
+test_cfg = dict(type='RetrievalTestLoop')
+
+randomness = dict(seed=42)
+
+default_hooks = dict(logger=dict(interval=1))
+
+custom_hooks = [
+ dict(
+ type='WarmupParamHook',
+ param_name='alpha',
+ module_name='head',
+ warmup_epochs=2)
+]
diff --git a/configs/blip/blip-base_8xb32_vqa.py b/configs/blip/blip-base_8xb32_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..2aa3f258579617d31b52b6e5a8e7703c56966dd4
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_vqa.py
@@ -0,0 +1,76 @@
+_base_ = [
+ '../_base_/datasets/coco_vg_vqa.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='BlipVQA',
+ tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=480,
+ patch_size=16,
+ out_type='raw'),
+ multimodal_backbone=dict(
+ type='XBertEncoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ head=dict(
+ type='VQAGenerationHead',
+ decoder=dict(
+ type='XBertLMHeadDecoder',
+ med_config=dict(
+ architectures=['BertModel'],
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ layer_norm_eps=1e-12,
+ max_position_embeddings=512,
+ model_type='bert',
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ pad_token_id=0,
+ add_type_embeddings=False,
+ vocab_size=30524,
+ encoder_width=768,
+ add_cross_attention=True),
+ ),
+ inference_method='rank', # or 'generate'
+ answer_list_path=
+ 'https://storage.googleapis.com/sfr-vision-language-research/datasets/answer_list.json', # noqa: E501
+ ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(max_epochs=10, by_epoch=True)
+test_cfg = dict()
+
+# runtime settings
+randomness = dict(seed=42)
diff --git a/configs/blip/metafile.yml b/configs/blip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8877e8192110df35415c875c834fc914bd3a038c
--- /dev/null
+++ b/configs/blip/metafile.yml
@@ -0,0 +1,99 @@
+Collections:
+ - Name: BLIP
+ Metadata:
+ Training Data:
+ - COCO
+ - VG
+ - Conceptual Captions
+ - Conceptual 12M
+ - SBU captions
+ Architecture:
+ - Transformer
+ Training Resources: 8x A100 GPUs
+ Paper:
+ Title: 'BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language
+ Understanding and Generation'
+ URL: https://arxiv.org/abs/2201.12086
+ README: configs/blip/README.md
+
+Models:
+ - Name: blip-base_8xb16_refcoco
+ Metadata:
+ FLOPs: null
+ Parameters: 498488636
+ In Collection: BLIP
+ Results:
+ - Task: Visual Grounding
+ Dataset: RefCOCO
+ Metrics:
+ Accuracy (testA): 86.14
+ Accuracy (testB): 77.33
+ Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.pth
+ Config: configs/blip/blip-base_8xb16_refcoco.py
+ - Name: blip-base_3rdparty_caption
+ Metadata:
+ FLOPs: null
+ Parameters: 223971644
+ In Collection: BLIP
+ Results:
+ - Dataset: COCO
+ Task: Image Caption
+ Metrics:
+ BLEU-4: 40.12
+ CIDER: 132.82
+ Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth
+ Config: configs/blip/blip-base_8xb32_caption.py
+ Converted From:
+ Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP/blip_coco_caption_base.pth
+ Code: https://github.com/salesforce/LAVIS
+ - Name: blip-base_3rdparty_nlvr
+ Metadata:
+ FLOPs: null
+ Parameters: 259372034
+ In Collection: BLIP
+ Results:
+ - Task: NLVR
+ Dataset: NLVR2
+ Metrics:
+ Top 1 Accuracy: 82.33
+ Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_nlvr_20230427-3b14d33f.pth
+ Config: configs/blip/blip-base_8xb32_nlvr.py
+ Converted From:
+ Weights: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_nlvr.pth
+ Code: https://github.com/salesforce/LAVIS
+ - Name: blip-base_3rdparty_vqa
+ Metadata:
+ FLOPs: null
+ Parameters: 361478972
+ In Collection: BLIP
+ Results:
+ - Task: Visual Question Answering
+ Dataset: VQAv2
+ Metrics:
+ Accuracy: 78.2
+ Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth
+ Config: configs/blip/blip-base_8xb32_vqa.py
+ Converted From:
+ Weights: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth
+ Code: https://github.com/salesforce/LAVIS
+ - Name: blip-base_3rdparty_retrieval
+ Metadata:
+ FLOPs: null
+ Parameters: 447486979
+ In Collection: BLIP
+ Results:
+ - Task: Image-To-Text Retrieval
+ Dataset: COCO
+ Metrics:
+ Recall@1: 82.52
+ Recall@5: 95.34
+ - Task: Text-To-Image Retrieval
+ Dataset: COCO
+ Metrics:
+ Recall@1: 64.82
+ Recall@5: 86.28
+ Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth
+ Config: configs/blip/blip-base_8xb32_retrieval.py
+ Converted From:
+ Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP/blip_coco_retrieval.pth
+ Code: https://github.com/salesforce/LAVIS
diff --git a/configs/blip2/README.md b/configs/blip2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..68ce679d704dfc23a0afdd7ec2528df9d144547e
--- /dev/null
+++ b/configs/blip2/README.md
@@ -0,0 +1,74 @@
+# BLIP-2
+
+> [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](http://arxiv.org/abs/2301.12597)
+
+
+
+## Abstract
+
+The cost of vision-and-language pre-training has become increasingly prohibitive due to end-toend training of large-scale models. This paper proposes BLIP-2, a generic and efficient pretraining strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pretrained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various visionlanguage tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model’s emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('blip2-opt2.7b_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a dog and a cat sitting on a blanket'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/blip2/blip2_8xb32_retrieval.py https://download.openmmlab.com/mmclassification/v1/blip2/blip2_3rdparty_pretrain_20230505-f7ef4390.pth
+```
+
+
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model | Params (M) | BLEU-4 | CIDER | Config | Download |
+| :------------------------------------------ | :--------: | :----: | :----: | :----------------------------------------: | :-------------------------------------------------------------------------------------------: |
+| `blip2-opt2.7b_3rdparty-zeroshot_caption`\* | 3770.47 | 32.90 | 111.10 | [config](./blip2-opt2.7b_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth) |
+
+### Visual Question Answering on VQAv2
+
+| Model | Params (M) | Accuracy | Config | Download |
+| :-------------------------------------- | :--------: | :------: | :------------------------------------: | :-------------------------------------------------------------------------------------------------------: |
+| `blip2-opt2.7b_3rdparty-zeroshot_vqa`\* | 3770.47 | 53.50 | [config](./blip2-opt2.7b_8xb16_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth) |
+
+### Image-To-Text Retrieval on COCO
+
+| Model | Params (M) | Recall@1 | Config | Download |
+| :--------------------------- | :--------: | :------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------------------: |
+| `blip2_3rdparty_retrieval`\* | 1173.19 | 85.40 | [config](./blip2_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip2/blip2_3rdparty_pretrain_20230505-f7ef4390.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{beitv2,
+ title={Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models},
+ author={Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven},
+ year={2023},
+ eprint={2301.12597},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
diff --git a/configs/blip2/blip2-opt2.7b_8xb16_gqa.py b/configs/blip2/blip2-opt2.7b_8xb16_gqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..37fbd95e8e4b49d87f4da7b8d0f4cc7650f23dcd
--- /dev/null
+++ b/configs/blip2/blip2-opt2.7b_8xb16_gqa.py
@@ -0,0 +1,87 @@
+_base_ = [
+ '../_base_/datasets/gqa.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='Blip2VQA',
+ tokenizer=dict(
+ type='AutoTokenizer', name_or_path='facebook/opt-2.7b',
+ use_fast=False),
+ vision_backbone=dict(
+ type='BEiTViT',
+ # eva-g without the final layer
+ arch=dict(
+ embed_dims=1408,
+ num_layers=39,
+ num_heads=16,
+ feedforward_channels=6144,
+ ),
+ img_size=364,
+ patch_size=14,
+ out_indices=-2,
+ layer_scale_init_value=0.0,
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ frozen_stages=39,
+ final_norm=False,
+ use_shared_rel_pos_bias=False,
+ out_type='raw'),
+ text_backbone=dict(
+ type='OPTForCausalLM', name_or_path='facebook/opt-2.7b'),
+ multimodal_backbone=dict(
+ type='Qformer',
+ model_style='bert-base-uncased',
+ vision_model_width=1408,
+ add_cross_attention=True,
+ cross_attention_freq=2,
+ num_query_token=32),
+ vision_neck=dict(
+ type='LinearClsHead',
+ in_channels=768,
+ num_classes=2560,
+ ),
+ prompt='Question: {} Short Answer:',
+ max_txt_len=10)
+
+# data settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='PackInputs', algorithm_keys=['question', 'gt_answer']),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(224, 224),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='CleanCaption',
+ keys=['question'],
+ ),
+ dict(type='PackInputs', algorithm_keys=['question', 'gt_answer']),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+ dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ begin=0,
+ end=10,
+ )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip2/blip2-opt2.7b_8xb16_vqa.py b/configs/blip2/blip2-opt2.7b_8xb16_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..13a808dc224454642392142f9f6598f42e717b64
--- /dev/null
+++ b/configs/blip2/blip2-opt2.7b_8xb16_vqa.py
@@ -0,0 +1,95 @@
+_base_ = [
+ '../_base_/datasets/coco_vqa.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='Blip2VQA',
+ tokenizer=dict(
+ type='AutoTokenizer', name_or_path='facebook/opt-2.7b',
+ use_fast=False),
+ vision_backbone=dict(
+ type='BEiTViT',
+ # eva-g without the final layer
+ arch=dict(
+ embed_dims=1408,
+ num_layers=39,
+ num_heads=16,
+ feedforward_channels=6144,
+ ),
+ img_size=364,
+ patch_size=14,
+ out_indices=-2,
+ layer_scale_init_value=0.0,
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ frozen_stages=39,
+ final_norm=False,
+ use_shared_rel_pos_bias=False,
+ out_type='raw'),
+ text_backbone=dict(
+ type='OPTForCausalLM', name_or_path='facebook/opt-2.7b'),
+ multimodal_backbone=dict(
+ type='Qformer',
+ model_style='bert-base-uncased',
+ vision_model_width=1408,
+ add_cross_attention=True,
+ cross_attention_freq=2,
+ num_query_token=32),
+ vision_neck=dict(
+ type='LinearClsHead',
+ in_channels=768,
+ num_classes=2560,
+ ),
+ prompt='Question: {} Answer:',
+ max_txt_len=10)
+
+# data settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(224, 224),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(
+ type='CleanCaption',
+ keys=['question'],
+ ),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+ dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ begin=0,
+ end=10,
+ )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip2/blip2-opt2.7b_8xb32_caption.py b/configs/blip2/blip2-opt2.7b_8xb32_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..52d0a63223ffdaf69730dffc2a6d4212765255a6
--- /dev/null
+++ b/configs/blip2/blip2-opt2.7b_8xb32_caption.py
@@ -0,0 +1,76 @@
+_base_ = [
+ '../_base_/datasets/coco_caption.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='Blip2Caption',
+ tokenizer=dict(
+ type='AutoTokenizer', name_or_path='facebook/opt-2.7b',
+ use_fast=False),
+ vision_backbone=dict(
+ type='BEiTViT',
+ # eva-g without the final layer
+ arch=dict(
+ embed_dims=1408,
+ num_layers=39,
+ num_heads=16,
+ feedforward_channels=6144,
+ ),
+ img_size=364,
+ patch_size=14,
+ out_indices=-2,
+ layer_scale_init_value=0.0,
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ frozen_stages=39,
+ final_norm=False,
+ use_shared_rel_pos_bias=False,
+ out_type='raw'),
+ text_backbone=dict(
+ type='OPTForCausalLM', name_or_path='facebook/opt-2.7b'),
+ multimodal_backbone=dict(
+ type='Qformer',
+ model_style='bert-base-uncased',
+ vision_model_width=1408,
+ add_cross_attention=True,
+ cross_attention_freq=2,
+ num_query_token=32),
+ vision_neck=dict(
+ type='LinearClsHead',
+ in_channels=768,
+ num_classes=2560,
+ ),
+ prompt='a photo of',
+ max_txt_len=30)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+ dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ begin=0,
+ end=10,
+ )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# dataset settings
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(364, 364),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
diff --git a/configs/blip2/blip2_8xb32_retrieval.py b/configs/blip2/blip2_8xb32_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..75cb66cbfd53ac5e4e53928a65eb8617f00fb4af
--- /dev/null
+++ b/configs/blip2/blip2_8xb32_retrieval.py
@@ -0,0 +1,82 @@
+_base_ = [
+ '../_base_/datasets/coco_retrieval.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='Blip2Retrieval',
+ tokenizer=dict(type='Blip2Tokenizer', name_or_path='bert-base-uncased'),
+ vision_backbone=dict(
+ type='BEiTViT',
+ # eva-g without the final layer
+ arch=dict(
+ embed_dims=1408,
+ num_layers=39,
+ num_heads=16,
+ feedforward_channels=6144,
+ ),
+ img_size=364,
+ patch_size=14,
+ layer_scale_init_value=0.0,
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ final_norm=False,
+ use_shared_rel_pos_bias=False,
+ out_type='raw'),
+ multimodal_backbone=dict(
+ type='Qformer',
+ model_style='bert-base-uncased',
+ vision_model_width=1408,
+ add_cross_attention=True,
+ cross_attention_freq=2,
+ num_query_token=32),
+ vision_neck=dict(
+ type='LinearClsHead',
+ in_channels=768,
+ num_classes=256,
+ ),
+ text_neck=dict(
+ type='LinearClsHead',
+ in_channels=768,
+ num_classes=256,
+ ),
+ multimodal_head=dict(
+ type='ITMHead',
+ hidden_size=768,
+ with_pooler=False,
+ ),
+ topk=128,
+ max_txt_len=35,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(364, 364),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CleanCaption', keys='text'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text', 'gt_text_id', 'gt_image_id'],
+ meta_keys=['image_id']),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.04)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+# learning rate scheduler
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=6)
+val_cfg = dict(type='RetrievalValLoop')
+test_cfg = dict(type='RetrievalTestLoop')
+
+randomness = dict(seed=42)
diff --git a/configs/blip2/metafile.yml b/configs/blip2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..b822103a21fa0b1b350ffbc6c5fdd6fb8ad4e8e2
--- /dev/null
+++ b/configs/blip2/metafile.yml
@@ -0,0 +1,71 @@
+Collections:
+ - Name: BLIP-2
+ Metadata:
+ Training Data:
+ - COCO
+ - VG
+ - CC3M
+ - CC12M
+ - SBU
+ - LAION-400M
+ Training Resources: 8x A100 GPUs
+ Architecture:
+ - Transformer
+ - Q-Former
+ Paper:
+ Title: 'BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
+ Encoders and Large Language Models'
+ URL: https://arxiv.org/abs/2301.12597
+ README: configs/blip2/README.md
+
+Models:
+ - Name: blip2_3rdparty_retrieval
+ Metadata:
+ FLOPs: null
+ Parameters: 1173191358
+ In Collection: BLIP-2
+ Results:
+ - Task: Image-To-Text Retrieval
+ Dataset: COCO
+ Metrics:
+ Recall@1: 85.4
+ - Task: Text-To-Image Retrieval
+ Dataset: COCO
+ Metrics:
+ Recall@1: 68.3
+ Weights: https://download.openmmlab.com/mmclassification/v1/blip2/blip2_3rdparty_pretrain_20230505-f7ef4390.pth
+ Config: configs/blip2/blip2_8xb32_retrieval.py
+ Converted From:
+ Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_opt2.7b.pth
+ Code: https://github.com/salesforce/LAVIS
+ - Name: blip2-opt2.7b_3rdparty-zeroshot_vqa
+ Metadata:
+ FLOPs: null
+ Parameters: 3770465152
+ In Collection: BLIP-2
+ Results:
+ - Task: Visual Question Answering
+ Dataset: VQAv2
+ Metrics:
+ Accuracy: 53.5
+ Weights: https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth
+ Config: configs/blip2/blip2-opt2.7b_8xb16_vqa.py
+ Converted From:
+ Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_opt2.7b.pth
+ Code: https://github.com/salesforce/LAVIS
+ - Name: blip2-opt2.7b_3rdparty-zeroshot_caption
+ Metadata:
+ FLOPs: null
+ Parameters: 3770465152
+ In Collection: BLIP-2
+ Results:
+ - Task: Image Caption
+ Dataset: COCO
+ Metrics:
+ BLEU-4: 32.90
+ CIDER: 111.10
+ Weights: https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth
+ Config: configs/blip2/blip2-opt2.7b_8xb32_caption.py
+ Converted From:
+ Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_opt2.7b.pth
+ Code: https://github.com/salesforce/LAVIS
diff --git a/configs/byol/README.md b/configs/byol/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2bfc8d064159ecfddaf2b2a4d0dca302b55e5f1f
--- /dev/null
+++ b/configs/byol/README.md
@@ -0,0 +1,85 @@
+# BYOL
+
+> [Bootstrap your own latent: A new approach to self-supervised Learning](https://arxiv.org/abs/2006.07733)
+
+
+
+## Abstract
+
+**B**ootstrap **Y**our **O**wn **L**atent (BYOL) is a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_byol-pre_8xb512-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('byol_resnet50_16xb256-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :-------------------------------------- | :--------: | :-------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `byol_resnet50_16xb256-coslr-200e_in1k` | 68.02 | 4.11 | [config](byol_resnet50_16xb256-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_byol-pre_8xb512-linear-coslr-90e_in1k` | [BYOL](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth) | 25.56 | 4.11 | 71.80 | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{grill2020bootstrap,
+ title={Bootstrap your own latent: A new approach to self-supervised learning},
+ author={Grill, Jean-Bastien and Strub, Florian and Altch{\'e}, Florent and Tallec, Corentin and Richemond, Pierre H and Buchatskaya, Elena and Doersch, Carl and Pires, Bernardo Avila and Guo, Zhaohan Daniel and Azar, Mohammad Gheshlaghi and others},
+ booktitle={NeurIPS},
+ year={2020}
+}
+```
diff --git a/configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py b/configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py
new file mode 100644
index 0000000000000000000000000000000000000000..4949db16a922737c5809b2c07519a6bb6867d165
--- /dev/null
+++ b/configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py
@@ -0,0 +1,46 @@
+_base_ = 'mmdet::mask_rcnn/mask-rcnn_r50-caffe-c4_1x_coco.py'
+# https://github.com/open-mmlab/mmdetection/blob/dev-3.x/configs/mask_rcnn/mask-rcnn_r50-caffe-c4_1x_coco.py
+
+data_preprocessor = dict(
+ type='DetDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ bgr_to_rgb=True,
+ pad_mask=True,
+ pad_size_divisor=32)
+
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+ data_preprocessor=data_preprocessor,
+ backbone=dict(
+ frozen_stages=-1,
+ norm_cfg=norm_cfg,
+ norm_eval=False,
+ style='pytorch',
+ init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
+ roi_head=dict(
+ shared_head=dict(
+ type='ResLayerExtraNorm',
+ norm_cfg=norm_cfg,
+ norm_eval=False,
+ style='pytorch')))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(1333, 640), (1333, 672), (1333, 704), (1333, 736),
+ (1333, 768), (1333, 800)],
+ keep_ratio=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(type='PackDetInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=12, val_interval=1)
+
+custom_imports = dict(
+ imports=['mmpretrain.models.utils.res_layer_extra_norm'],
+ allow_failed_imports=False)
diff --git a/configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py b/configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py
new file mode 100644
index 0000000000000000000000000000000000000000..1341f1508bdc400da6e79b47e1a174c0819fc79b
--- /dev/null
+++ b/configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py
@@ -0,0 +1,24 @@
+_base_ = 'mmdet::mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py'
+# https://github.com/open-mmlab/mmdetection/blob/dev-3.x/configs/mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py
+
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+ backbone=dict(frozen_stages=-1, norm_cfg=norm_cfg, norm_eval=False),
+ neck=dict(norm_cfg=norm_cfg),
+ roi_head=dict(
+ bbox_head=dict(type='Shared4Conv1FCBBoxHead', norm_cfg=norm_cfg),
+ mask_head=dict(norm_cfg=norm_cfg)))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
+ dict(
+ type='RandomChoiceResize',
+ scales=[(1333, 640), (1333, 672), (1333, 704), (1333, 736),
+ (1333, 768), (1333, 800)],
+ keep_ratio=True),
+ dict(type='RandomFlip', prob=0.5),
+ dict(type='PackDetInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
diff --git a/configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+ '../../_base_/models/resnet50.py',
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+ '../../_base_/default_runtime.py',
+]
+
+model = dict(
+ backbone=dict(
+ frozen_stages=4,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py b/configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8dd3fd8bee88206f18d79500c401fa1f787d6e7f
--- /dev/null
+++ b/configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
@@ -0,0 +1,60 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_byol.py',
+ '../_base_/schedules/imagenet_lars_coslr_200e.py',
+ '../_base_/default_runtime.py',
+]
+
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+ type='BYOL',
+ base_momentum=0.01,
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=False),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=True,
+ with_last_bn=False,
+ with_avg_pool=True),
+ head=dict(
+ type='LatentPredictHead',
+ predictor=dict(
+ type='NonLinearNeck',
+ in_channels=256,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=True,
+ with_last_bn=False,
+ with_avg_pool=False),
+ loss=dict(type='CosineSimilarityLoss')),
+)
+
+# optimizer
+optimizer = dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6)
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=optimizer,
+ paramwise_cfg=dict(
+ custom_keys={
+ 'bn': dict(decay_mult=0, lars_exclude=True),
+ 'bias': dict(decay_mult=0, lars_exclude=True),
+ # bn layer in ResNet block downsample module
+ 'downsample.1': dict(decay_mult=0, lars_exclude=True),
+ }),
+)
+
+# runtime settings
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/byol/metafile.yml b/configs/byol/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..09aacad1c580ec4ec4abe08e60dffd30eba540a8
--- /dev/null
+++ b/configs/byol/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+ - Name: BYOL
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - LARS
+ Training Resources: 8x V100 GPUs (b256), 16x A100-80G GPUs (b4096)
+ Architecture:
+ - ResNet
+ - BYOL
+ Paper:
+ Title: 'Bootstrap your own latent: A new approach to self-supervised Learning'
+ URL: https://arxiv.org/abs/2006.07733
+ README: configs/byol/README.md
+
+Models:
+ - Name: byol_resnet50_16xb256-coslr-200e_in1k
+ Metadata:
+ Epochs: 200
+ Batch Size: 4096
+ FLOPs: 4109364224
+ Parameters: 68024448
+ Training Data: ImageNet-1k
+ In Collection: BYOL
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth
+ Config: configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
+ Downstream:
+ - resnet50_byol-pre_8xb512-linear-coslr-90e_in1k
+ - Name: resnet50_byol-pre_8xb512-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 4096
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: BYOL
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 71.8
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.pth
+ Config: configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/cae/README.md b/configs/cae/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dc1c818d71c35300930c4a11b5e2ed52b995cd0e
--- /dev/null
+++ b/configs/cae/README.md
@@ -0,0 +1,86 @@
+# CAE
+
+> [Context Autoencoder for Self-Supervised Representation Learning](https://arxiv.org/abs/2202.03026)
+
+
+
+## Abstract
+
+We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised learning. We randomly partition the image into two sets: visible patches and masked patches. The CAE architecture consists of: (i) an encoder that takes visible patches as input and outputs their latent representations, (ii) a latent context regressor that predicts the masked patch representations from the visible patch representations that are not updated in this regressor, (iii) a decoder that takes the estimated masked patch representations as input and makes predictions for the masked patches, and (iv) an alignment module that aligns the masked patch representation estimation with the masked patch representations computed from the encoder. In comparison to previous MIM methods that couple the encoding and decoding roles, e.g., using a single module in BEiT, our approach attempts to separate the encoding role (content understanding) from the decoding role (making predictions for masked patches) using different modules, improving the content understanding capability. In addition, our approach makes predictions from the visible patches to the masked patches in the latent representation space that is expected to take on semantics. In addition, we present the explanations about why contrastive pretraining and supervised pretraining perform similarly and why MIM potentially performs better. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_cae-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('cae_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :--------------------------------------------- | :--------: | :-------: | :-------------------------------------------------------: | :----------------------------------------------------------------------------: |
+| `cae_beit-base-p16_8xb256-amp-coslr-300e_in1k` | 288.43 | 17.58 | [config](cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `beit-base-p16_cae-pre_8xb128-coslr-100e_in1k` | [CAE](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.pth) | 86.68 | 17.58 | 83.20 | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.json) |
+
+## Citation
+
+```bibtex
+@article{CAE,
+ title={Context Autoencoder for Self-Supervised Representation Learning},
+ author={Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo,
+ Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang},
+ journal={ArXiv},
+ year={2022}
+}
+```
diff --git a/configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e7083ce80a8311220fe6ebd5b6024c195887aa57
--- /dev/null
+++ b/configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,130 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+# CAE fine-tuning setting
+
+# dataset
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline), batch_size=128)
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ final_norm=False, # do not use final norm
+ drop_path_rate=0.1,
+ layer_scale_init_value=0.1,
+ out_type='avg_featmap',
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=True,
+ use_shared_rel_pos_bias=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=dict(type='TruncNormal', layer='Linear', std=2e-5)),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=8e-3, betas=(0.9, 0.999), weight_decay=0.05),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.65,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ by_epoch=True,
+ begin=5,
+ end=100,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..725b0f07ce71fa0ea98ae7343f0dbf47adda3ebb
--- /dev/null
+++ b/configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,115 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='TwoNormDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ second_mean=[-31.875, -31.875, -31.875],
+ second_std=[318.75, 318.75, 318.75],
+ to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomFlip', prob=0.5),
+ dict(
+ type='RandomResizedCropAndInterpolationWithTwoPic',
+ size=224,
+ second_size=112,
+ interpolation='bicubic',
+ second_interpolation='lanczos',
+ scale=(0.08, 1.0)),
+ dict(
+ type='BEiTMaskGenerator',
+ input_size=(14, 14),
+ num_masking_patches=75,
+ max_num_patches=None,
+ min_num_patches=16),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/train.txt',
+ data_prefix=dict(img_path='train/'),
+ pipeline=train_pipeline))
+
+# model settings
+model = dict(
+ type='CAE',
+ backbone=dict(
+ type='CAEPretrainViT',
+ arch='b',
+ patch_size=16,
+ layer_scale_init_value=0.1,
+ bias='qv_bias'),
+ neck=dict(
+ type='CAENeck',
+ embed_dims=768,
+ num_heads=12,
+ regressor_depth=4,
+ decoder_depth=4,
+ mlp_ratio=4,
+ layer_scale_init_value=0.1,
+ ),
+ head=dict(type='CAEHead', loss=dict(type='CAELoss', lambd=2)),
+ target_generator=dict(
+ type='DALL-E',
+ init_cfg=dict(
+ type='Pretrained',
+ checkpoint= # noqa: E251
+ 'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/dalle_encoder.pth', # noqa: E501
+ )),
+ base_momentum=0.0)
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+ clip_grad=dict(max_norm=3.0),
+ paramwise_cfg=dict(
+ bias_decay_mult=0.0, norm_decay_mult=0.0, flat_decay_mult=0.0))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=290,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=10,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/cae/metafile.yml b/configs/cae/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..83f46f9f810384979a0f0b4483e9ab518653bcff
--- /dev/null
+++ b/configs/cae/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+ - Name: CAE
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ Training Resources: 8x A100-80G GPUs
+ Architecture:
+ - ViT
+ Paper:
+ Title: Context Autoencoder for Self-Supervised Representation Learning
+ URL: https://arxiv.org/abs/2202.03026
+ README: configs/cae/README.md
+
+Models:
+ - Name: cae_beit-base-p16_8xb256-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 2048
+ FLOPs: 17581976064
+ Parameters: 288429952
+ Training Data: ImageNet-1k
+ In Collection: CAE
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.pth
+ Config: configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+ Downstream:
+ - beit-base-p16_cae-pre_8xb128-coslr-100e_in1k
+ - Name: beit-base-p16_cae-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581219584
+ Parameters: 86682280
+ Training Data: ImageNet-1k
+ In Collection: CAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.2
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.pth
+ Config: configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
diff --git a/configs/chinese_clip/README.md b/configs/chinese_clip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..acb37e7a2adfdf641e07a695ec064cf8507f33ed
--- /dev/null
+++ b/configs/chinese_clip/README.md
@@ -0,0 +1,69 @@
+# ChineseCLIP
+
+> [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335)
+
+
+
+## Abstract
+
+The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). We have released our codes, models, and demos in https://github.com/OFA-Sys/Chinese-CLIP
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model for zero-shot classification**
+
+```python
+from mmpretrain import ImageClassificationInferencer
+
+inferencer = ImageClassificationInferencer(
+ 'cn-clip_resnet50_zeroshot-cls_cifar100',
+ pretrained=True,
+ classes=['鸟', '狗', '猫', '蛇'],
+ text_prototype=['鸟', '狗', '猫', '蛇'],
+)
+
+prediction = inferencer('./demo/bird.JPEG')[0]
+print('Results:', prediction['pred_class'])
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_resnet50_3rdparty_20230519-6a2b3eb2.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on CIFAR100
+
+| Model | Params (M) | Top-1 (%) | Config | Download |
+| :---------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :----------------------------------------------------------------------------: |
+| `cn-clip_resnet50_zeroshot-cls_cifar100`\* | 77.00 | 40.70 | [config](cn-clip_resnet50_zeroshot-cls_cifar100.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_resnet50_3rdparty_20230519-6a2b3eb2.pth) |
+| `cn-clip_vit-base-p16_zeroshot-cls_cifar100`\* | 188.00 | 64.50 | [config](cn-clip_vit-base-p16_zeroshot-cls_cifar100.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-base-p16_3rdparty_20230519-37fbc59e.pth) |
+| `cn-clip_vit-large-p14_zeroshot-cls_cifar100`\* | 406.00 | 74.80 | [config](cn-clip_vit-large-p14_zeroshot-cls_cifar100.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-large-p14_3rdparty_20230519-3f844503.pth) |
+| `cn-clip_vit-huge-p14_zeroshot-cls_cifar100`\* | 958.00 | 79.10 | [config](cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-huge-p14_3rdparty_20230519-e4f49b00.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/Chinese-CLIP). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{chinese-clip,
+ title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
+ author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
+ journal={arXiv preprint arXiv:2211.01335},
+ year={2022}
+}
+```
diff --git a/configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..e109a5bfbb4442580aa830259a2a29f4ba11a0b5
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py
@@ -0,0 +1,72 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=False,
+)
+
+test_pipeline = [
+ dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+ dict(
+ type='PackInputs',
+ meta_keys=['image_id', 'scale_factor'],
+ ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='CIFAR100',
+ data_root='data/cifar100',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+ type='ChineseCLIP',
+ vision_backbone=dict(
+ type='ModifiedResNet',
+ depth=50,
+ base_channels=64,
+ input_size=224,
+ num_attn_heads=32,
+ output_dim=1024,
+ ),
+ text_backbone=dict(
+ type='BertModelCN',
+ config=dict(
+ vocab_size=21128,
+ pad_token_id=0,
+ add_type_embeddings=True,
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ max_position_embeddings=512,
+ num_attention_heads=12,
+ num_hidden_layers=3,
+ type_vocab_size=2,
+ layer_norm_eps=1e-12)),
+ tokenizer=dict(
+ type='FullTokenizer',
+ vocab_file= # noqa
+ 'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+ ),
+ proj_dim=1024,
+ text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c0ad1c9e39bcbfc615e688d5fc8c2812789989b
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py
@@ -0,0 +1,76 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=False,
+)
+
+test_pipeline = [
+ dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text'],
+ meta_keys=['image_id', 'scale_factor'],
+ ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='CIFAR100',
+ data_root='data/cifar100',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+ type='ChineseCLIP',
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ final_norm=True,
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ pre_norm=True,
+ out_type='cls_token',
+ ),
+ text_backbone=dict(
+ type='BertModelCN',
+ config=dict(
+ vocab_size=21128,
+ pad_token_id=0,
+ add_type_embeddings=True,
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ max_position_embeddings=512,
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ type_vocab_size=2,
+ layer_norm_eps=1e-12)),
+ tokenizer=dict(
+ type='FullTokenizer',
+ vocab_file= # noqa
+ 'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+ ),
+ proj_dim=512,
+ text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..83aae122e8f0d2ec4fd78bb69e94feda09672980
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,75 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=False,
+)
+
+test_pipeline = [
+ dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+ dict(
+ type='PackInputs',
+ meta_keys=['image_id', 'scale_factor'],
+ ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='CIFAR100',
+ data_root='data/cifar100',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+ type='ChineseCLIP',
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='huge',
+ img_size=224,
+ patch_size=14,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ final_norm=True,
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ pre_norm=True,
+ out_type='cls_token',
+ ),
+ text_backbone=dict(
+ type='BertModelCN',
+ config=dict(
+ vocab_size=21128,
+ pad_token_id=0,
+ add_type_embeddings=True,
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=1024,
+ initializer_range=0.02,
+ intermediate_size=4096,
+ max_position_embeddings=512,
+ num_attention_heads=16,
+ num_hidden_layers=24,
+ type_vocab_size=2,
+ layer_norm_eps=1e-12)),
+ tokenizer=dict(
+ type='FullTokenizer',
+ vocab_file= # noqa
+ 'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+ ),
+ proj_dim=1024,
+ text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..35f0b6fb53fa2bf8d389f4a0f6ea08bdbac72175
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,75 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=False,
+)
+
+test_pipeline = [
+ dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+ dict(
+ type='PackInputs',
+ meta_keys=['image_id', 'scale_factor'],
+ ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='CIFAR100',
+ data_root='data/cifar100',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+ type='ChineseCLIP',
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='large',
+ img_size=224,
+ patch_size=14,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ final_norm=True,
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ pre_norm=True,
+ out_type='cls_token',
+ ),
+ text_backbone=dict(
+ type='BertModelCN',
+ config=dict(
+ vocab_size=21128,
+ pad_token_id=0,
+ add_type_embeddings=True,
+ attention_probs_dropout_prob=0.1,
+ hidden_act='gelu',
+ hidden_dropout_prob=0.1,
+ hidden_size=768,
+ initializer_range=0.02,
+ intermediate_size=3072,
+ max_position_embeddings=512,
+ num_attention_heads=12,
+ num_hidden_layers=12,
+ type_vocab_size=2,
+ layer_norm_eps=1e-12)),
+ tokenizer=dict(
+ type='FullTokenizer',
+ vocab_file= # noqa
+ 'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+ ),
+ proj_dim=768,
+ text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/metafile.yml b/configs/chinese_clip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40ebb49e001691c4b8adc87a9e1f24d352e41441
--- /dev/null
+++ b/configs/chinese_clip/metafile.yml
@@ -0,0 +1,79 @@
+Collections:
+ - Name: ChineseCLIP
+ Metadata:
+ Training Data:
+ - LAION-5B
+ - WuKong
+ - VisualGenome
+ - MSCOCO
+ Architecture:
+ - Transformer
+ Paper:
+ Title: 'Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese'
+ URL: https://arxiv.org/abs/2211.01335
+ README: configs/chinese_clip/README.md
+
+Models:
+ - Name: cn-clip_resnet50_zeroshot-cls_cifar100
+ Metadata:
+ FLOPs: null
+ Parameters: 77000000
+ In Collection: ChineseCLIP
+ Results:
+ - Task: Image Classification
+ Dataset: CIFAR100
+ Metrics:
+ Top 1 Accuracy: 40.7
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_resnet50_3rdparty_20230519-6a2b3eb2.pth
+ Config: configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py
+ Converted From:
+ Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_rn50.pt
+ Code: https://github.com/OFA-Sys/Chinese-CLIP
+
+ - Name: cn-clip_vit-base-p16_zeroshot-cls_cifar100
+ Metadata:
+ FLOPs: null
+ Parameters: 188000000
+ In Collection: ChineseCLIP
+ Results:
+ - Task: Image Classification
+ Dataset: CIFAR100
+ Metrics:
+ Top 1 Accuracy: 64.5
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-base-p16_3rdparty_20230519-37fbc59e.pth
+ Config: configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py
+ Converted From:
+ Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-b-16.pt
+ Code: https://github.com/OFA-Sys/Chinese-CLIP
+
+ - Name: cn-clip_vit-large-p14_zeroshot-cls_cifar100
+ Metadata:
+ FLOPs: null
+ Parameters: 406000000
+ In Collection: ChineseCLIP
+ Results:
+ - Task: Image Classification
+ Dataset: CIFAR100
+ Metrics:
+ Top 1 Accuracy: 74.8
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-large-p14_3rdparty_20230519-3f844503.pth
+ Config: configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py
+ Converted From:
+ Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-l-14.pt
+ Code: https://github.com/OFA-Sys/Chinese-CLIP
+
+ - Name: cn-clip_vit-huge-p14_zeroshot-cls_cifar100
+ Metadata:
+ FLOPs: null
+ Parameters: 958000000
+ In Collection: ChineseCLIP
+ Results:
+ - Task: Image Classification
+ Dataset: CIFAR100
+ Metrics:
+ Top 1 Accuracy: 79.1
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-huge-p14_3rdparty_20230519-e4f49b00.pth
+ Config: configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py
+ Converted From:
+ Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-h-14.pt
+ Code: https://github.com/OFA-Sys/Chinese-CLIP
diff --git a/configs/clip/README.md b/configs/clip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7a14be4d8e05fe3ba1c9d51106889b63029964b9
--- /dev/null
+++ b/configs/clip/README.md
@@ -0,0 +1,90 @@
+# CLIP
+
+> [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
+
+
+
+## Abstract
+
+State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/clip/vit-base-p32_pt-64xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k_20221220-b384e830.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------------- | :-----------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :----------------------------------------------: |
+| `vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k`\* | CLIP LAION2B ImageNet-12k | 88.22 | 4.36 | 83.06 | 96.49 | [config](vit-base-p32_pt-64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k_20221220-b384e830.pth) |
+| `vit-base-p32_clip-laion2b-pre_3rdparty_in1k`\* | CLIP LAION2B | 88.22 | 4.36 | 82.46 | 96.12 | [config](vit-base-p32_pt-64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-pre_3rdparty_in1k_20221220-194df57f.pth) |
+| `vit-base-p32_clip-openai-pre_3rdparty_in1k`\* | CLIP OPENAI | 88.22 | 4.36 | 81.77 | 95.89 | [config](vit-base-p32_pt-64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-pre_3rdparty_in1k_20221220-a0182ba9.pth) |
+| `vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-384px`\* | CLIP LAION2B ImageNet-12k | 88.22 | 12.66 | 85.39 | 97.67 | [config](vit-base-p32_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-c7757552.pth) |
+| `vit-base-p32_clip-openai-in12k-pre_3rdparty_in1k-384px`\* | CLIP OPENAI ImageNet-12k | 88.22 | 12.66 | 85.13 | 97.42 | [config](vit-base-p32_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-in12k-pre_3rdparty_in1k-384px_20221220-dc2e49ea.pth) |
+| `vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k`\* | CLIP LAION2B ImageNet-12k | 86.57 | 16.86 | 86.02 | 97.76 | [config](vit-base-p16_pt-64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k_20221220-a5e31f8c.pth) |
+| `vit-base-p16_clip-laion2b-pre_3rdparty_in1k`\* | CLIP LAION2B | 86.57 | 16.86 | 85.49 | 97.59 | [config](vit-base-p16_pt-64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k_20221220-5e24ff58.pth) |
+| `vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k`\* | CLIP OPENAI ImageNet-12k | 86.57 | 16.86 | 85.99 | 97.72 | [config](vit-base-p16_pt-64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k_20221220-90d930a8.pth) |
+| `vit-base-p16_clip-openai-pre_3rdparty_in1k`\* | CLIP OPENAI | 86.57 | 16.86 | 85.30 | 97.50 | [config](vit-base-p16_pt-64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k_20221220-c7d9c899.pth) |
+| `vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-448px`\* | CLIP LAION2B ImageNet-12k | 88.22 | 17.20 | 85.76 | 97.63 | [config](vit-base-p32_pt-64xb64_in1k-448px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-448px_20221220-ca404a7d.pth) |
+| `vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k-384px`\* | CLIP LAION2B ImageNet-12k | 86.57 | 49.37 | 87.17 | 98.02 | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-84ed0cc0.pth) |
+| `vit-base-p16_clip-laion2b-pre_3rdparty_in1k-384px`\* | CLIP LAION2B | 86.57 | 49.37 | 86.52 | 97.97 | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k-384px_20221220-558ed826.pth) |
+| `vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k-384px`\* | CLIP OPENAI ImageNet-12k | 86.57 | 49.37 | 86.87 | 98.05 | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k-384px_20221220-8df86b74.pth) |
+| `vit-base-p16_clip-openai-pre_3rdparty_in1k-384px`\* | CLIP OPENAI | 86.57 | 49.37 | 86.25 | 97.90 | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k-384px_20221220-eb012e87.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@InProceedings{pmlr-v139-radford21a,
+title = {Learning Transferable Visual Models From Natural Language Supervision},
+author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
+booktitle = {Proceedings of the 38th International Conference on Machine Learning},
+year = {2021},
+series = {Proceedings of Machine Learning Research},
+publisher = {PMLR},
+}
+```
diff --git a/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py b/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd684a50a319e9e2b4942ce59ae6e20744b2743e
--- /dev/null
+++ b/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py
@@ -0,0 +1,68 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=False,
+)
+
+test_pipeline = [
+ dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text'],
+ meta_keys=['image_id', 'scale_factor'],
+ ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='CIFAR100',
+ data_root='data/cifar100',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+ type='CLIPZeroShot',
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_rate=0.,
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ pre_norm=True,
+ ),
+ projection=dict(type='CLIPProjection', in_channels=768, out_channels=512),
+ text_backbone=dict(
+ type='CLIPTransformer',
+ width=512,
+ layers=12,
+ heads=8,
+ attn_mask=True,
+ ),
+ tokenizer=dict(
+ type='AutoTokenizer',
+ name_or_path='openai/clip-vit-base-patch16',
+ use_fast=False),
+ vocab_size=49408,
+ transformer_width=512,
+ proj_dim=512,
+ text_prototype='cifar100',
+ text_prompt='openai_cifar100',
+ context_length=77,
+)
diff --git a/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py b/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..80c4fde82f514c96d9f171d6b3ed57fdbccd923a
--- /dev/null
+++ b/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py
@@ -0,0 +1,69 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text'],
+ meta_keys=['image_id', 'scale_factor'],
+ ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='ImageNet',
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+ type='CLIPZeroShot',
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_rate=0.,
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ pre_norm=True,
+ ),
+ projection=dict(type='CLIPProjection', in_channels=768, out_channels=512),
+ text_backbone=dict(
+ type='CLIPTransformer',
+ width=512,
+ layers=12,
+ heads=8,
+ attn_mask=True,
+ ),
+ tokenizer=dict(
+ type='AutoTokenizer',
+ name_or_path='openai/clip-vit-base-patch16',
+ use_fast=False),
+ vocab_size=49408,
+ transformer_width=512,
+ proj_dim=512,
+ text_prototype='imagenet',
+ text_prompt='openai_imagenet_sub', # openai_imagenet, openai_imagenet_sub
+ context_length=77,
+)
diff --git a/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py b/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6dd7c1141211914c9e9835b73d0ee84a46ea3b6
--- /dev/null
+++ b/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,68 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=False,
+)
+
+test_pipeline = [
+ dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text'],
+ meta_keys=['image_id', 'scale_factor'],
+ ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='CIFAR100',
+ data_root='data/cifar100',
+ split='test',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+ type='CLIPZeroShot',
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='large',
+ img_size=224,
+ patch_size=14,
+ drop_rate=0.,
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ pre_norm=True,
+ ),
+ projection=dict(type='CLIPProjection', in_channels=1024, out_channels=768),
+ text_backbone=dict(
+ type='CLIPTransformer',
+ width=768,
+ layers=12,
+ heads=12,
+ attn_mask=True,
+ ),
+ tokenizer=dict(
+ type='AutoTokenizer',
+ name_or_path='openai/clip-vit-large-patch14',
+ use_fast=False),
+ vocab_size=49408,
+ transformer_width=768,
+ proj_dim=768,
+ text_prototype='cifar100',
+ text_prompt='openai_cifar100',
+ context_length=77,
+)
diff --git a/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py b/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..10500017a9300e7c2cf8082e575378f346888c3d
--- /dev/null
+++ b/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py
@@ -0,0 +1,69 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text'],
+ meta_keys=['image_id', 'scale_factor'],
+ ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+ batch_size=32,
+ num_workers=8,
+ dataset=dict(
+ type='ImageNet',
+ data_root='data/imagenet',
+ split='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+ type='CLIPZeroShot',
+ vision_backbone=dict(
+ type='VisionTransformer',
+ arch='large',
+ img_size=224,
+ patch_size=14,
+ drop_rate=0.,
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ pre_norm=True,
+ ),
+ projection=dict(type='CLIPProjection', in_channels=1024, out_channels=768),
+ text_backbone=dict(
+ type='CLIPTransformer',
+ width=768,
+ layers=12,
+ heads=12,
+ attn_mask=True,
+ ),
+ tokenizer=dict(
+ type='AutoTokenizer',
+ name_or_path='openai/clip-vit-large-patch14',
+ use_fast=False),
+ vocab_size=49408,
+ transformer_width=768,
+ proj_dim=768,
+ text_prototype='imagenet',
+ text_prompt='openai_imagenet_sub', # openai_imagenet, openai_imagenet_sub
+ context_length=77,
+)
diff --git a/configs/clip/metafile.yml b/configs/clip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..a82eea49aa0815cf94ac9324ffaea445f815a473
--- /dev/null
+++ b/configs/clip/metafile.yml
@@ -0,0 +1,308 @@
+Collections:
+ - Name: CLIP
+ Metadata:
+ Architecture:
+ - Attention Dropout
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ - Tanh Activation
+ Paper:
+ Title: Learning Transferable Visual Models From Natural Language Supervision
+ URL: https://arxiv.org/abs/2103.00020
+ README: configs/clip/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/vision_transformer.py
+ Version: v1.0.0
+
+Models:
+ - Name: vit-base-p32_clip-openai-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 4364335104
+ Parameters: 88225000
+ Training Data:
+ - OpenAI
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.77
+ Top 5 Accuracy: 95.89
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-pre_3rdparty_in1k_20221220-a0182ba9.pth
+ Config: configs/clip/vit-base-p32_pt-64xb64_in1k.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch32_clip_224.openai_ft_in1k
+ - Name: vit-base-p32_clip-laion2b-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 4364335104
+ Parameters: 88225000
+ Training Data:
+ - LAION-2B
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.46
+ Top 5 Accuracy: 96.12
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-pre_3rdparty_in1k_20221220-194df57f.pth
+ Config: configs/clip/vit-base-p32_pt-64xb64_in1k.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch32_clip_224.laion2b_ft_in1k
+ - Name: vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 4364335104
+ Parameters: 88225000
+ Training Data:
+ - LAION-2B
+ - ImageNet-12k
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.06
+ Top 5 Accuracy: 96.49
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k_20221220-b384e830.pth
+ Config: configs/clip/vit-base-p32_pt-64xb64_in1k.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch32_clip_224.laion2b_ft_in12k_in1k
+ - Name: vit-base-p32_clip-openai-in12k-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 12661054464
+ Parameters: 88225000
+ Training Data:
+ - OpenAI
+ - ImageNet-12k
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.13
+ Top 5 Accuracy: 97.42
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-in12k-pre_3rdparty_in1k-384px_20221220-dc2e49ea.pth
+ Config: configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch32_clip_384.openai_ft_in12k_in1k
+ - Name: vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 12661054464
+ Parameters: 88225000
+ Training Data:
+ - LAION-2B
+ - ImageNet-12k
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.39
+ Top 5 Accuracy: 97.67
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-c7757552.pth
+ Config: configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch32_clip_384.laion2b_ft_in12k_in1k
+ - Name: vit-base-p16_clip-openai-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 16855600128
+ Parameters: 86568424
+ Training Data:
+ - OpenAI
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.3
+ Top 5 Accuracy: 97.5
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k_20221220-c7d9c899.pth
+ Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.openai_ft_in1k
+ - Name: vit-base-p16_clip-laion2b-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 16855600128
+ Parameters: 86568424
+ Training Data:
+ - LAION-2B
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.49
+ Top 5 Accuracy: 97.59
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k_20221220-5e24ff58.pth
+ Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.laion2b_ft_in1k
+ - Name: vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 16855600128
+ Parameters: 86568424
+ Training Data:
+ - OpenAI
+ - ImageNet-12k
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.99
+ Top 5 Accuracy: 97.72
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k_20221220-90d930a8.pth
+ Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.openai_ft_in12k_in1k
+ - Name: vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 16855600128
+ Parameters: 86568424
+ Training Data:
+ - LAION-2B
+ - ImageNet-12k
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.02
+ Top 5 Accuracy: 97.76
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k_20221220-a5e31f8c.pth
+ Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.laion2b_ft_in12k_in1k
+ - Name: vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-448px
+ Metadata:
+ FLOPs: 17202416640
+ Parameters: 88225000
+ Training Data:
+ - LAION-2B
+ - ImageNet-12k
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.76
+ Top 5 Accuracy: 97.63
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-448px_20221220-ca404a7d.pth
+ Config: configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch32_clip_448.laion2b_ft_in12k_in1k
+ - Name: vit-base-p16_clip-openai-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 49370078208
+ Parameters: 86568424
+ Training Data:
+ - OpenAI
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.25
+ Top 5 Accuracy: 97.9
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k-384px_20221220-eb012e87.pth
+ Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.openai_ft_in1k
+ - Name: vit-base-p16_clip-laion2b-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 49370078208
+ Parameters: 86568424
+ Training Data:
+ - LAION-2B
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.52
+ Top 5 Accuracy: 97.97
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k-384px_20221220-558ed826.pth
+ Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.laion2b_ft_in1k
+ - Name: vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 49370078208
+ Parameters: 86568424
+ Training Data:
+ - OpenAI
+ - ImageNet-12k
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.87
+ Top 5 Accuracy: 98.05
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k-384px_20221220-8df86b74.pth
+ Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.openai_ft_in12k_in1k
+ - Name: vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 49370078208
+ Parameters: 86568424
+ Training Data:
+ - LAION-2B
+ - ImageNet-12k
+ - ImageNet-1k
+ In Collection: CLIP
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 87.17
+ Top 5 Accuracy: 98.02
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-84ed0cc0.pth
+ Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+ Converted From:
+ Code: https://github.com/rwightman/pytorch-image-models
+ Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.laion2b_ft_in12k_in1k
+ - Name: vit-large-p14_clip-openai-pre_3rdparty
+ Metadata:
+ FLOPs: 59696580608
+ Parameters: 303302656
+ Training Data:
+ - OpenAI
+ In Collection: CLIP
+ Weights: https://download.openmmlab.com/mmclassification/v0/clip/vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth
+ Config: configs/clip/vit-large-p14_headless.py
+ Converted From:
+ Code: https://github.com/mlfoundations/open_clip
+ Weights: https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt
diff --git a/configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py b/configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..14046ce3e40cce46944ccc0ddef6c884c38d9c89
--- /dev/null
+++ b/configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
@@ -0,0 +1,40 @@
+_base_ = [
+ '../_base_/models/vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=384,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p16_pt-64xb64_in1k-448px.py b/configs/clip/vit-base-p16_pt-64xb64_in1k-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..02af585753074f3a831188a01085917eb04dad4b
--- /dev/null
+++ b/configs/clip/vit-base-p16_pt-64xb64_in1k-448px.py
@@ -0,0 +1,40 @@
+_base_ = [
+ '../_base_/models/vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=448,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=448,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=448),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p16_pt-64xb64_in1k.py b/configs/clip/vit-base-p16_pt-64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd018bac622744bdcf6cd50821612a9148c4a85d
--- /dev/null
+++ b/configs/clip/vit-base-p16_pt-64xb64_in1k.py
@@ -0,0 +1,40 @@
+_base_ = [
+ '../_base_/models/vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=224,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py b/configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1acf78ab6bf335cc0e3cd1012fbe7773336c61e
--- /dev/null
+++ b/configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
@@ -0,0 +1,40 @@
+_base_ = [
+ '../_base_/models/vit-base-p32.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=384,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=384,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py b/configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f50391f15bb1dc60b94d5ef163f4e88e3b4e509
--- /dev/null
+++ b/configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py
@@ -0,0 +1,40 @@
+_base_ = [
+ '../_base_/models/vit-base-p32.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=448,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=448,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=448),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p32_pt-64xb64_in1k.py b/configs/clip/vit-base-p32_pt-64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..abbb50089edb9057504e7571bd29fddaa1c53dc9
--- /dev/null
+++ b/configs/clip/vit-base-p32_pt-64xb64_in1k.py
@@ -0,0 +1,40 @@
+_base_ = [
+ '../_base_/models/vit-base-p32.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=224,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-large-p14_headless.py b/configs/clip/vit-large-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9b965d4f0edc4794b05a3ea6a917a0d350a27f3
--- /dev/null
+++ b/configs/clip/vit-large-p14_headless.py
@@ -0,0 +1,34 @@
+_base_ = ['../_base_/default_runtime.py']
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='l',
+ img_size=224,
+ patch_size=16,
+ drop_rate=0.1,
+ pre_norm=True,
+ ),
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+test_dataloader = dict(
+ batch_size=64,
+ num_workers=5,
+ dataset=dict(
+ type='ImageNet',
+ data_root='data/imagenet',
+ ann_file='meta/val.txt',
+ data_prefix='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = None
diff --git a/configs/conformer/README.md b/configs/conformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..04b5d4770b22c346a149dfc0bf7c1dfc2713a2a6
--- /dev/null
+++ b/configs/conformer/README.md
@@ -0,0 +1,84 @@
+# Conformer
+
+> [Conformer: Local Features Coupling Global Representations for Visual Recognition](https://arxiv.org/abs/2105.03889)
+
+
+
+## Abstract
+
+Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('conformer-tiny-p16_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('conformer-tiny-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/conformer/conformer-small-p32_8xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/conformer/conformer-tiny-p16_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-p16_3rdparty_8xb128_in1k_20211206-f6860372.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------: | :--------------------------------------------------------------------: |
+| `conformer-tiny-p16_3rdparty_in1k`\* | From scratch | 23.52 | 4.90 | 81.31 | 95.60 | [config](conformer-tiny-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-p16_3rdparty_8xb128_in1k_20211206-f6860372.pth) |
+| `conformer-small-p16_3rdparty_in1k`\* | From scratch | 37.67 | 10.31 | 83.32 | 96.46 | [config](conformer-small-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p16_3rdparty_8xb128_in1k_20211206-3065dcf5.pth) |
+| `conformer-small-p32_8xb128_in1k` | From scratch | 38.85 | 7.09 | 81.96 | 96.02 | [config](conformer-small-p32_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p32_8xb128_in1k_20211206-947a0816.pth) |
+| `conformer-base-p16_3rdparty_in1k`\* | From scratch | 83.29 | 22.89 | 83.82 | 96.59 | [config](conformer-base-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-base-p16_3rdparty_8xb128_in1k_20211206-bfdf8637.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pengzhiliang/Conformer/blob/main/models.py#L89). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{peng2021conformer,
+ title={Conformer: Local Features Coupling Global Representations for Visual Recognition},
+ author={Zhiliang Peng and Wei Huang and Shanzhi Gu and Lingxi Xie and Yaowei Wang and Jianbin Jiao and Qixiang Ye},
+ journal={arXiv preprint arXiv:2105.03889},
+ year={2021},
+}
+```
diff --git a/configs/conformer/conformer-base-p16_8xb128_in1k.py b/configs/conformer/conformer-base-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a44f56f3ac3213c616a6e960ce2476466eb65bbd
--- /dev/null
+++ b/configs/conformer/conformer-base-p16_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+ '../_base_/models/conformer/base-p16.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+ '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/conformer-small-p16_8xb128_in1k.py b/configs/conformer/conformer-small-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a937f4f9e60c3987a6ff3d2b7320a0dd49855cbc
--- /dev/null
+++ b/configs/conformer/conformer-small-p16_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+ '../_base_/models/conformer/small-p16.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+ '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/conformer-small-p32_8xb128_in1k.py b/configs/conformer/conformer-small-p32_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b07ce2ce3fba146675b7a8453cc581f2a011db1
--- /dev/null
+++ b/configs/conformer/conformer-small-p32_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+ '../_base_/models/conformer/small-p32.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+ '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/conformer-tiny-p16_8xb128_in1k.py b/configs/conformer/conformer-tiny-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f88c6c3b0da3c50e0b3ccb2454b200dfbaf7c4c7
--- /dev/null
+++ b/configs/conformer/conformer-tiny-p16_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+ '../_base_/models/conformer/tiny-p16.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+ '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/metafile.yml b/configs/conformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..c0821bad059c32db978f02c4935a41ec0c054c16
--- /dev/null
+++ b/configs/conformer/metafile.yml
@@ -0,0 +1,78 @@
+Collections:
+ - Name: Conformer
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Layer Normalization
+ - Scaled Dot-Product Attention
+ - Dropout
+ Paper:
+ URL: https://arxiv.org/abs/2105.03889
+ Title: "Conformer: Local Features Coupling Global Representations for Visual Recognition"
+ README: configs/conformer/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.19.0/mmcls/models/backbones/conformer.py
+ Version: v0.19.0
+
+Models:
+ - Name: conformer-tiny-p16_3rdparty_in1k
+ In Collection: Conformer
+ Config: configs/conformer/conformer-tiny-p16_8xb128_in1k.py
+ Metadata:
+ FLOPs: 4899611328
+ Parameters: 23524704
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.31
+ Top 5 Accuracy: 95.60
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-p16_3rdparty_8xb128_in1k_20211206-f6860372.pth
+ Converted From:
+ Weights: https://drive.google.com/file/d/19SxGhKcWOR5oQSxNUWUM2MGYiaWMrF1z/view?usp=sharing
+ Code: https://github.com/pengzhiliang/Conformer/blob/main/models.py#L65
+ - Name: conformer-small-p16_3rdparty_in1k
+ In Collection: Conformer
+ Config: configs/conformer/conformer-small-p16_8xb128_in1k.py
+ Metadata:
+ FLOPs: 10311309312
+ Parameters: 37673424
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.32
+ Top 5 Accuracy: 96.46
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p16_3rdparty_8xb128_in1k_20211206-3065dcf5.pth
+ Converted From:
+ Weights: https://drive.google.com/file/d/1mpOlbLaVxOfEwV4-ha78j_1Ebqzj2B83/view?usp=sharing
+ Code: https://github.com/pengzhiliang/Conformer/blob/main/models.py#L73
+ - Name: conformer-small-p32_8xb128_in1k
+ In Collection: Conformer
+ Config: configs/conformer/conformer-small-p32_8xb128_in1k.py
+ Metadata:
+ FLOPs: 7087281792
+ Parameters: 38853072
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.96
+ Top 5 Accuracy: 96.02
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p32_8xb128_in1k_20211206-947a0816.pth
+ - Name: conformer-base-p16_3rdparty_in1k
+ In Collection: Conformer
+ Config: configs/conformer/conformer-base-p16_8xb128_in1k.py
+ Metadata:
+ FLOPs: 22892078080
+ Parameters: 83289136
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.82
+ Top 5 Accuracy: 96.59
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-base-p16_3rdparty_8xb128_in1k_20211206-bfdf8637.pth
+ Converted From:
+ Weights: https://drive.google.com/file/d/1oeQ9LSOGKEUaYGu7WTlUGl3KDsQIi0MA/view?usp=sharing
+ Code: https://github.com/pengzhiliang/Conformer/blob/main/models.py#L89
diff --git a/configs/convmixer/README.md b/configs/convmixer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a87d27ffb8ec0dd6a6182d99133a227b0b29945b
--- /dev/null
+++ b/configs/convmixer/README.md
@@ -0,0 +1,79 @@
+# ConvMixer
+
+> [Patches Are All You Need?](https://arxiv.org/abs/2201.09792)
+
+
+
+## Abstract
+
+Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('convmixer-768-32_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('convmixer-768-32_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/convmixer/convmixer-768-32_10xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------: |
+| `convmixer-768-32_3rdparty_in1k`\* | From scratch | 21.11 | 19.62 | 80.16 | 95.08 | [config](convmixer-768-32_10xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth) |
+| `convmixer-1024-20_3rdparty_in1k`\* | From scratch | 24.38 | 5.55 | 76.94 | 93.36 | [config](convmixer-1024-20_10xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1024-20_3rdparty_10xb64_in1k_20220323-48f8aeba.pth) |
+| `convmixer-1536-20_3rdparty_in1k`\* | From scratch | 51.63 | 48.71 | 81.37 | 95.61 | [config](convmixer-1536-20_10xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1536_20_3rdparty_10xb64_in1k_20220323-ea5786f3.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/locuslab/convmixer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{trockman2022patches,
+ title={Patches Are All You Need?},
+ author={Asher Trockman and J. Zico Kolter},
+ year={2022},
+ eprint={2201.09792},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
diff --git a/configs/convmixer/convmixer-1024-20_10xb64_in1k.py b/configs/convmixer/convmixer-1024-20_10xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0dbc664261e2244cb35a779211c45b5b854d4cc5
--- /dev/null
+++ b/configs/convmixer/convmixer-1024-20_10xb64_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/models/convmixer/convmixer-1024-20.py',
+ '../_base_/datasets/imagenet_bs64_convmixer_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=0.01),
+ clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=130,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=20,
+ end=150)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=150)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (10 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=640)
diff --git a/configs/convmixer/convmixer-1536-20_10xb64_in1k.py b/configs/convmixer/convmixer-1536-20_10xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3c8cc95c20312311ee06cee911dc186944de5b7f
--- /dev/null
+++ b/configs/convmixer/convmixer-1536-20_10xb64_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/models/convmixer/convmixer-1536-20.py',
+ '../_base_/datasets/imagenet_bs64_convmixer_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=0.01),
+ clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=130,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=20,
+ end=150)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=150)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (10 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=640)
diff --git a/configs/convmixer/convmixer-768-32_10xb64_in1k.py b/configs/convmixer/convmixer-768-32_10xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d872d4429134ef8c88ea87da3c93b6532472423e
--- /dev/null
+++ b/configs/convmixer/convmixer-768-32_10xb64_in1k.py
@@ -0,0 +1,19 @@
+_base_ = [
+ '../_base_/models/convmixer/convmixer-768-32.py',
+ '../_base_/datasets/imagenet_bs64_convmixer_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=0.01),
+ clip_grad=dict(max_norm=5.0),
+)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (10 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=640)
diff --git a/configs/convmixer/metafile.yml b/configs/convmixer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f9dcdc7cc71ddc72791ab47666c0a35d30a9f349
--- /dev/null
+++ b/configs/convmixer/metafile.yml
@@ -0,0 +1,61 @@
+Collections:
+ - Name: ConvMixer
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - 1x1 Convolution
+ - LayerScale
+ Paper:
+ URL: https://arxiv.org/abs/2201.09792
+ Title: Patches Are All You Need?
+ README: configs/convmixer/README.md
+
+Models:
+ - Name: convmixer-768-32_3rdparty_in1k
+ Metadata:
+ FLOPs: 19623051264
+ Parameters: 21110248
+ In Collection: ConvMixer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.16
+ Top 5 Accuracy: 95.08
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth
+ Config: configs/convmixer/convmixer-768-32_10xb64_in1k.py
+ Converted From:
+ Weights: https://github.com/tmp-iclr/convmixer/releases/download/v1.0/convmixer_768_32_ks7_p7_relu.pth.tar
+ Code: https://github.com/locuslab/convmixer
+ - Name: convmixer-1024-20_3rdparty_in1k
+ Metadata:
+ FLOPs: 5550112768
+ Parameters: 24383464
+ In Collection: ConvMixer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.94
+ Top 5 Accuracy: 93.36
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1024-20_3rdparty_10xb64_in1k_20220323-48f8aeba.pth
+ Config: configs/convmixer/convmixer-1024-20_10xb64_in1k.py
+ Converted From:
+ Weights: https://github.com/tmp-iclr/convmixer/releases/download/v1.0/convmixer_1024_20_ks9_p14.pth.tar
+ Code: https://github.com/locuslab/convmixer
+ - Name: convmixer-1536-20_3rdparty_in1k
+ Metadata:
+ FLOPs: 48713170944
+ Parameters: 51625960
+ In Collection: ConvMixer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.37
+ Top 5 Accuracy: 95.61
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1536_20_3rdparty_10xb64_in1k_20220323-ea5786f3.pth
+ Config: configs/convmixer/convmixer-1536-20_10xb64_in1k.py
+ Converted From:
+ Weights: https://github.com/tmp-iclr/convmixer/releases/download/v1.0/convmixer_1536_20_ks9_p7.pth.tar
+ Code: https://github.com/locuslab/convmixer
diff --git a/configs/convnext/README.md b/configs/convnext/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2e6e14c2f2e65af68c1f8177bdec91f70a0b3149
--- /dev/null
+++ b/configs/convnext/README.md
@@ -0,0 +1,123 @@
+# ConvNeXt
+
+> [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545v1)
+
+
+
+## Introduction
+
+**ConvNeXt** is initially described in [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545v1), which is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers. The ConvNeXt has the pyramid structure and achieve competitive performance on various vision tasks, with simplicity and efficiency.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('convnext-tiny_32xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('convnext-tiny_32xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/convnext/convnext-tiny_32xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/convnext/convnext-tiny_32xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :--------------------------------- | :--------: | :-------: | :---------------------------------------: | :--------------------------------------------------------------------------------------------------------: |
+| `convnext-base_3rdparty_in21k`\* | 88.59 | 15.36 | [config](convnext-base_32xb128_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in21k_20220124-13b83eec.pth) |
+| `convnext-large_3rdparty_in21k`\* | 197.77 | 34.37 | [config](convnext-large_64xb64_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in21k_20220124-41b5a79f.pth) |
+| `convnext-xlarge_3rdparty_in21k`\* | 350.20 | 60.93 | [config](convnext-xlarge_64xb64_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_3rdparty_in21k_20220124-f909bad7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :------------------------------------------------------: |
+| `convnext-tiny_32xb128_in1k` | From scratch | 28.59 | 4.46 | 82.14 | 96.06 | [config](convnext-tiny_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.json) |
+| `convnext-tiny_32xb128-noema_in1k` | From scratch | 28.59 | 4.46 | 81.95 | 95.89 | [config](convnext-tiny_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128-noema_in1k_20221208-5d4509c7.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.json) |
+| `convnext-tiny_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 28.59 | 4.46 | 82.90 | 96.62 | [config](convnext-tiny_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k_20221219-7501e534.pth) |
+| `convnext-tiny_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 28.59 | 13.14 | 84.11 | 97.14 | [config](convnext-tiny_32xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k-384px_20221219-c1182362.pth) |
+| `convnext-small_32xb128_in1k` | From scratch | 50.22 | 8.69 | 83.16 | 96.56 | [config](convnext-small_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.json) |
+| `convnext-small_32xb128-noema_in1k` | From scratch | 50.22 | 8.69 | 83.21 | 96.48 | [config](convnext-small_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128-noema_in1k_20221208-4a618995.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.json) |
+| `convnext-small_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 50.22 | 8.69 | 84.59 | 97.41 | [config](convnext-small_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k_20221219-aeca4c93.pth) |
+| `convnext-small_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 50.22 | 25.58 | 85.75 | 97.88 | [config](convnext-small_32xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k-384px_20221219-96f0bb87.pth) |
+| `convnext-base_32xb128_in1k` | From scratch | 88.59 | 15.36 | 83.66 | 96.74 | [config](convnext-base_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.json) |
+| `convnext-base_32xb128-noema_in1k` | From scratch | 88.59 | 15.36 | 83.64 | 96.61 | [config](convnext-base_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128-noema_in1k_20221208-f8182678.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.json) |
+| `convnext-base_3rdparty_in1k`\* | From scratch | 88.59 | 15.36 | 83.85 | 96.74 | [config](convnext-base_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128_in1k_20220124-d0915162.pth) |
+| `convnext-base_3rdparty-noema_in1k`\* | From scratch | 88.59 | 15.36 | 83.71 | 96.60 | [config](convnext-base_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128-noema_in1k_20220222-dba4f95f.pth) |
+| `convnext-base_3rdparty_in1k-384px`\* | From scratch | 88.59 | 45.21 | 85.10 | 97.34 | [config](convnext-base_32xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in1k-384px_20221219-c8f1dc2b.pth) |
+| `convnext-base_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 88.59 | 15.36 | 85.81 | 97.86 | [config](convnext-base_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_32xb128_in1k_20220124-eb2d6ada.pth) |
+| `convnext-base_in21k-pre-3rdparty_in1k-384px`\* | From scratch | 88.59 | 45.21 | 86.82 | 98.25 | [config](convnext-base_32xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_in1k-384px_20221219-4570f792.pth) |
+| `convnext-large_3rdparty_in1k`\* | From scratch | 197.77 | 34.37 | 84.30 | 96.89 | [config](convnext-large_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_64xb64_in1k_20220124-f8a0ded0.pth) |
+| `convnext-large_3rdparty_in1k-384px`\* | From scratch | 197.77 | 101.10 | 85.50 | 97.59 | [config](convnext-large_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in1k-384px_20221219-6dd29d10.pth) |
+| `convnext-large_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 197.77 | 34.37 | 86.61 | 98.04 | [config](convnext-large_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_64xb64_in1k_20220124-2412403d.pth) |
+| `convnext-large_in21k-pre-3rdparty_in1k-384px`\* | From scratch | 197.77 | 101.10 | 87.46 | 98.37 | [config](convnext-large_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_in1k-384px_20221219-6d38dd66.pth) |
+| `convnext-xlarge_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 350.20 | 60.93 | 86.97 | 98.20 | [config](convnext-xlarge_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_64xb64_in1k_20220124-76b6863d.pth) |
+| `convnext-xlarge_in21k-pre-3rdparty_in1k-384px`\* | From scratch | 350.20 | 179.20 | 87.76 | 98.55 | [config](convnext-xlarge_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_in1k-384px_20221219-b161bc14.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@Article{liu2022convnet,
+ author = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
+ title = {A ConvNet for the 2020s},
+ journal = {arXiv preprint arXiv:2201.03545},
+ year = {2022},
+}
+```
diff --git a/configs/convnext/convnext-base_32xb128_in1k-384px.py b/configs/convnext/convnext-base_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..65546942562ac17b3d4510c78d3090aa8b87a831
--- /dev/null
+++ b/configs/convnext/convnext-base_32xb128_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-base.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-base_32xb128_in1k.py b/configs/convnext/convnext-base_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ae8ec47c4c7ac3f22712c97dbad315c7a798e6f
--- /dev/null
+++ b/configs/convnext/convnext-base_32xb128_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-base.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-base_32xb128_in21k.py b/configs/convnext/convnext-base_32xb128_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c343526c7f084501fc3651c1581752209f5019a4
--- /dev/null
+++ b/configs/convnext/convnext-base_32xb128_in21k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-base.py',
+ '../_base_/datasets/imagenet21k_bs128.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor = dict(num_classes=21841)
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-large_64xb64_in1k-384px.py b/configs/convnext/convnext-large_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6698b9edcdae463d6d1cf943237efbaf236cd71c
--- /dev/null
+++ b/configs/convnext/convnext-large_64xb64_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-large.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-large_64xb64_in1k.py b/configs/convnext/convnext-large_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a78c58bc3d85e0e08083d339378886f870388bc
--- /dev/null
+++ b/configs/convnext/convnext-large_64xb64_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-large.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-large_64xb64_in21k.py b/configs/convnext/convnext-large_64xb64_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..420edab67b1dc094f08b4a3810af522b2a988b62
--- /dev/null
+++ b/configs/convnext/convnext-large_64xb64_in21k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-base.py',
+ '../_base_/datasets/imagenet21k_bs128.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor = dict(num_classes=21841)
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-small_32xb128_in1k-384px.py b/configs/convnext/convnext-small_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..729f00ad2fdf53943ffae9de165e2e9985e733c7
--- /dev/null
+++ b/configs/convnext/convnext-small_32xb128_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-small.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-small_32xb128_in1k.py b/configs/convnext/convnext-small_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b623e900f830fbea7891b61c737398c0dee1076e
--- /dev/null
+++ b/configs/convnext/convnext-small_32xb128_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-small.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-tiny_32xb128_in1k-384px.py b/configs/convnext/convnext-tiny_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6513ad8dfa41714ecb5c9de5992496716337c596
--- /dev/null
+++ b/configs/convnext/convnext-tiny_32xb128_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-tiny.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-tiny_32xb128_in1k.py b/configs/convnext/convnext-tiny_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..59d3004bde89510b5c44110c8a6513957c0cbba0
--- /dev/null
+++ b/configs/convnext/convnext-tiny_32xb128_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-tiny.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py b/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6edc94d2448157fc82bf38a988bf4393f192a89f
--- /dev/null
+++ b/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-xlarge.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-xlarge_64xb64_in1k.py b/configs/convnext/convnext-xlarge_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..528894e808b7085ee66d8be89cf84f860ddec979
--- /dev/null
+++ b/configs/convnext/convnext-xlarge_64xb64_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-xlarge.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-xlarge_64xb64_in21k.py b/configs/convnext/convnext-xlarge_64xb64_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..420edab67b1dc094f08b4a3810af522b2a988b62
--- /dev/null
+++ b/configs/convnext/convnext-xlarge_64xb64_in21k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/convnext/convnext-base.py',
+ '../_base_/datasets/imagenet21k_bs128.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor = dict(num_classes=21841)
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/metafile.yml b/configs/convnext/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..16896629f07ffadd5313a6e38bc1532ddc3c08f2
--- /dev/null
+++ b/configs/convnext/metafile.yml
@@ -0,0 +1,410 @@
+Collections:
+ - Name: ConvNeXt
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - 1x1 Convolution
+ - LayerScale
+ Paper:
+ URL: https://arxiv.org/abs/2201.03545v1
+ Title: A ConvNet for the 2020s
+ README: configs/convnext/README.md
+ Code:
+ Version: v0.20.1
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/convnext.py
+
+Models:
+ - Name: convnext-tiny_32xb128_in1k
+ Metadata:
+ FLOPs: 4457472768
+ Parameters: 28589128
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.14
+ Top 5 Accuracy: 96.06
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.pth
+ Config: configs/convnext/convnext-tiny_32xb128_in1k.py
+ - Name: convnext-tiny_32xb128-noema_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 4457472768
+ Parameters: 28589128
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.95
+ Top 5 Accuracy: 95.89
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128-noema_in1k_20221208-5d4509c7.pth
+ Config: configs/convnext/convnext-tiny_32xb128_in1k.py
+ - Name: convnext-tiny_in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 4457472768
+ Parameters: 28589128
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.90
+ Top 5 Accuracy: 96.62
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k_20221219-7501e534.pth
+ Config: configs/convnext/convnext-tiny_32xb128_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_tiny_22k_1k_224.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-tiny_in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 13135236864
+ Parameters: 28589128
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.11
+ Top 5 Accuracy: 97.14
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k-384px_20221219-c1182362.pth
+ Config: configs/convnext/convnext-tiny_32xb128_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_tiny_22k_1k_384.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-small_32xb128_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 8687008512
+ Parameters: 50223688
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.16
+ Top 5 Accuracy: 96.56
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.pth
+ Config: configs/convnext/convnext-small_32xb128_in1k.py
+ - Name: convnext-small_32xb128-noema_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 8687008512
+ Parameters: 50223688
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.21
+ Top 5 Accuracy: 96.48
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128-noema_in1k_20221208-4a618995.pth
+ Config: configs/convnext/convnext-small_32xb128_in1k.py
+ - Name: convnext-small_in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 8687008512
+ Parameters: 50223688
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.59
+ Top 5 Accuracy: 97.41
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k_20221219-aeca4c93.pth
+ Config: configs/convnext/convnext-small_32xb128_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_1k_224.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-small_in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 25580818176
+ Parameters: 50223688
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.75
+ Top 5 Accuracy: 97.88
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k-384px_20221219-96f0bb87.pth
+ Config: configs/convnext/convnext-small_32xb128_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_1k_384.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-base_32xb128_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 15359124480
+ Parameters: 88591464
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.66
+ Top 5 Accuracy: 96.74
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.pth
+ Config: configs/convnext/convnext-base_32xb128_in1k.py
+ - Name: convnext-base_32xb128-noema_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 15359124480
+ Parameters: 88591464
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.64
+ Top 5 Accuracy: 96.61
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128-noema_in1k_20221208-f8182678.pth
+ Config: configs/convnext/convnext-base_32xb128_in1k.py
+ - Name: convnext-base_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 15359124480
+ Parameters: 88591464
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.85
+ Top 5 Accuracy: 96.74
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128_in1k_20220124-d0915162.pth
+ Config: configs/convnext/convnext-base_32xb128_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224_ema.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-base_3rdparty-noema_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 15359124480
+ Parameters: 88591464
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.71
+ Top 5 Accuracy: 96.60
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128-noema_in1k_20220222-dba4f95f.pth
+ Config: configs/convnext/convnext-base_32xb128_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-base_3rdparty_in1k-384px
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 45205885952
+ Parameters: 88591464
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.10
+ Top 5 Accuracy: 97.34
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in1k-384px_20221219-c8f1dc2b.pth
+ Config: configs/convnext/convnext-base_32xb128_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_384.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-base_3rdparty_in21k
+ Metadata:
+ Training Data: ImageNet-21k
+ FLOPs: 15359124480
+ Parameters: 88591464
+ In Collection: ConvNeXt
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in21k_20220124-13b83eec.pth
+ Config: configs/convnext/convnext-base_32xb128_in21k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_224.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-base_in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 15359124480
+ Parameters: 88591464
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.81
+ Top 5 Accuracy: 97.86
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_32xb128_in1k_20220124-eb2d6ada.pth
+ Config: configs/convnext/convnext-base_32xb128_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_1k_224.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-base_in21k-pre-3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 45205885952
+ Parameters: 88591464
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.82
+ Top 5 Accuracy: 98.25
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_in1k-384px_20221219-4570f792.pth
+ Config: configs/convnext/convnext-base_32xb128_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_1k_384.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-large_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 34368026112
+ Parameters: 197767336
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.30
+ Top 5 Accuracy: 96.89
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_64xb64_in1k_20220124-f8a0ded0.pth
+ Config: configs/convnext/convnext-large_64xb64_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_1k_224_ema.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-large_3rdparty_in1k-384px
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 101103214080
+ Parameters: 197767336
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.50
+ Top 5 Accuracy: 97.59
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in1k-384px_20221219-6dd29d10.pth
+ Config: configs/convnext/convnext-large_64xb64_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_1k_384.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-large_3rdparty_in21k
+ Metadata:
+ Training Data: ImageNet-21k
+ FLOPs: 34368026112
+ Parameters: 197767336
+ In Collection: ConvNeXt
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in21k_20220124-41b5a79f.pth
+ Config: configs/convnext/convnext-large_64xb64_in21k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_224.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-large_in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 34368026112
+ Parameters: 197767336
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.61
+ Top 5 Accuracy: 98.04
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_64xb64_in1k_20220124-2412403d.pth
+ Config: configs/convnext/convnext-large_64xb64_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_1k_224.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-large_in21k-pre-3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 101103214080
+ Parameters: 197767336
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 87.46
+ Top 5 Accuracy: 98.37
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_in1k-384px_20221219-6d38dd66.pth
+ Config: configs/convnext/convnext-large_64xb64_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_1k_384.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-xlarge_3rdparty_in21k
+ Metadata:
+ Training Data: ImageNet-21k
+ FLOPs: 60929820672
+ Parameters: 350196968
+ In Collection: ConvNeXt
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_3rdparty_in21k_20220124-f909bad7.pth
+ Config: configs/convnext/convnext-xlarge_64xb64_in21k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_224.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-xlarge_in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 60929820672
+ Parameters: 350196968
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.97
+ Top 5 Accuracy: 98.20
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_64xb64_in1k_20220124-76b6863d.pth
+ Config: configs/convnext/convnext-xlarge_64xb64_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_1k_224_ema.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
+ - Name: convnext-xlarge_in21k-pre-3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 179196798976
+ Parameters: 350196968
+ In Collection: ConvNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 87.76
+ Top 5 Accuracy: 98.55
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_in1k-384px_20221219-b161bc14.pth
+ Config: configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_1k_384_ema.pth
+ Code: https://github.com/facebookresearch/ConvNeXt
diff --git a/configs/convnext_v2/README.md b/configs/convnext_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e561387412aa3a8e088cb7d015e7b98dba8e50c1
--- /dev/null
+++ b/configs/convnext_v2/README.md
@@ -0,0 +1,107 @@
+# ConvNeXt V2
+
+> [Co-designing and Scaling ConvNets with Masked Autoencoders](http://arxiv.org/abs/2301.00808)
+
+
+
+## Abstract
+
+Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('convnext-v2-atto_fcmae-pre_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('convnext-v2-atto_3rdparty-fcmae_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_fcmae-pre_3rdparty_in1k_20230104-23765f83.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :---------------------------------------- | :--------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------------------------------: |
+| `convnext-v2-atto_3rdparty-fcmae_in1k`\* | 3.71 | 0.55 | [config](convnext-v2-atto_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_3rdparty-fcmae_in1k_20230104-07514db4.pth) |
+| `convnext-v2-femto_3rdparty-fcmae_in1k`\* | 5.23 | 0.78 | [config](convnext-v2-femto_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_3rdparty-fcmae_in1k_20230104-adbe2082.pth) |
+| `convnext-v2-pico_3rdparty-fcmae_in1k`\* | 9.07 | 1.37 | [config](convnext-v2-pico_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_3rdparty-fcmae_in1k_20230104-147b1b59.pth) |
+| `convnext-v2-nano_3rdparty-fcmae_in1k`\* | 15.62 | 2.45 | [config](convnext-v2-nano_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_3rdparty-fcmae_in1k_20230104-3dd1f29e.pth) |
+| `convnext-v2-tiny_3rdparty-fcmae_in1k`\* | 28.64 | 4.47 | [config](convnext-v2-tiny_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_3rdparty-fcmae_in1k_20230104-80513adc.pth) |
+| `convnext-v2-base_3rdparty-fcmae_in1k`\* | 88.72 | 15.38 | [config](convnext-v2-base_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_3rdparty-fcmae_in1k_20230104-8a798eaf.pth) |
+| `convnext-v2-large_3rdparty-fcmae_in1k`\* | 197.96 | 34.40 | [config](convnext-v2-large_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_3rdparty-fcmae_in1k_20230104-bf38df92.pth) |
+| `convnext-v2-huge_3rdparty-fcmae_in1k`\* | 660.29 | 115.00 | [config](convnext-v2-huge_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_3rdparty-fcmae_in1k_20230104-fe43ae6c.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt-V2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------------: | :------------------------------------------------: |
+| `convnext-v2-atto_fcmae-pre_3rdparty_in1k`\* | FCMAE | 3.71 | 0.55 | 76.64 | 93.04 | [config](convnext-v2-atto_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_fcmae-pre_3rdparty_in1k_20230104-23765f83.pth) |
+| `convnext-v2-femto_fcmae-pre_3rdparty_in1k`\* | FCMAE | 5.23 | 0.78 | 78.48 | 93.98 | [config](convnext-v2-femto_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_fcmae-pre_3rdparty_in1k_20230104-92a75d75.pth) |
+| `convnext-v2-pico_fcmae-pre_3rdparty_in1k`\* | FCMAE | 9.07 | 1.37 | 80.31 | 95.08 | [config](convnext-v2-pico_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_fcmae-pre_3rdparty_in1k_20230104-d20263ca.pth) |
+| `convnext-v2-nano_fcmae-pre_3rdparty_in1k`\* | FCMAE | 15.62 | 2.45 | 81.86 | 95.75 | [config](convnext-v2-nano_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-pre_3rdparty_in1k_20230104-fe1aaaf2.pth) |
+| `convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k | 15.62 | 2.45 | 82.04 | 96.16 | [config](convnext-v2-nano_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k_20230104-91fa8ae2.pth) |
+| `convnext-v2-tiny_fcmae-pre_3rdparty_in1k`\* | FCMAE | 28.64 | 4.47 | 82.94 | 96.29 | [config](convnext-v2-tiny_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-pre_3rdparty_in1k_20230104-471a86de.pth) |
+| `convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k | 28.64 | 4.47 | 83.89 | 96.96 | [config](convnext-v2-tiny_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k_20230104-8cc8b8f2.pth) |
+| `convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k | 15.62 | 7.21 | 83.36 | 96.75 | [config](convnext-v2-nano_32xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-f951ae87.pth) |
+| `convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k | 28.64 | 13.14 | 85.09 | 97.63 | [config](convnext-v2-tiny_32xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-d8579f84.pth) |
+| `convnext-v2-base_fcmae-pre_3rdparty_in1k`\* | FCMAE | 88.72 | 15.38 | 84.87 | 97.08 | [config](convnext-v2-base_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-pre_3rdparty_in1k_20230104-00a70fa4.pth) |
+| `convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k | 88.72 | 15.38 | 86.74 | 98.02 | [config](convnext-v2-base_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k_20230104-c48d16a5.pth) |
+| `convnext-v2-large_fcmae-pre_3rdparty_in1k`\* | FCMAE | 197.96 | 34.40 | 85.76 | 97.59 | [config](convnext-v2-large_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-pre_3rdparty_in1k_20230104-ef393013.pth) |
+| `convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k | 197.96 | 34.40 | 87.26 | 98.24 | [config](convnext-v2-large_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k_20230104-d9c4dc0c.pth) |
+| `convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k | 88.72 | 45.21 | 87.63 | 98.42 | [config](convnext-v2-base_32xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-379425cc.pth) |
+| `convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k | 197.96 | 101.10 | 88.18 | 98.52 | [config](convnext-v2-large_32xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-9139a1f3.pth) |
+| `convnext-v2-huge_fcmae-pre_3rdparty_in1k`\* | FCMAE | 660.29 | 115.00 | 86.25 | 97.75 | [config](convnext-v2-huge_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-pre_3rdparty_in1k_20230104-f795e5b8.pth) |
+| `convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k | 660.29 | 337.96 | 88.68 | 98.73 | [config](convnext-v2-huge_32xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-02a4eb35.pth) |
+| `convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px`\* | FCMAE ImageNet-21k | 660.29 | 600.81 | 88.86 | 98.74 | [config](convnext-v2-huge_32xb32_in1k-512px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px_20230104-ce32e63c.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt-V2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Woo2023ConvNeXtV2,
+ title={ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders},
+ author={Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon and Saining Xie},
+ year={2023},
+ journal={arXiv preprint arXiv:2301.00808},
+}
+```
diff --git a/configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..68f34c9634e3390bb3c600351ef37e9a94c6d575
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/convnext_v2/atto.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=8e-4, weight_decay=0.3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..70b7f18e0c9dfa92791ff1a8a77553680de673e7
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+ '../_base_/models/convnext_v2/base.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-base_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b66b375eb3a3872842b4fdf72285db36a76dc3b8
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+ '../_base_/models/convnext_v2/base.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..053e19478fe75dac91b616fa314f4fbdd2667c61
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/convnext_v2/femto.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=8e-4, weight_decay=0.3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b734b271ef9a7ada6085c14465a43ee05841b348
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+ '../_base_/models/convnext_v2/huge.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c63b023be3cbcca94e0847ed88febfd1b099223
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py
@@ -0,0 +1,54 @@
+_base_ = [
+ '../_base_/models/convnext_v2/huge.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=512,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=512, backend='pillow', interpolation='bicubic'),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=32, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..18621f3aeb86c1a8ad620d71625c2952ca145320
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+ '../_base_/models/convnext_v2/huge.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b08b12eb0507b2582fe237b498c97f57452e29ec
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+ '../_base_/models/convnext_v2/large.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-large_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9695d08e9c63bae6f440a427c07ddb68b08403b
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+ '../_base_/models/convnext_v2/large.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=20,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9b36dc59229e0dba661211c3570771453f54113
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/convnext_v2/nano.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=8e-4, weight_decay=0.3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a7c9e3e629522b42b9ff4d02a479b4688a74b92
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/convnext_v2/nano.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=8e-4, weight_decay=0.3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2cc52ff252972724d4d6737dda1e784abc4d536
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/convnext_v2/pico.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=8e-4, weight_decay=0.3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a19fd6cc670c33726187d41cef41ff33e69d8edd
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+ '../_base_/models/convnext_v2/tiny.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=3.2e-3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=40,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=40)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c6fbd0f2cd4189fb1699959cf8d63228a1ab3515
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+ '../_base_/models/convnext_v2/tiny.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=3.2e-3),
+ clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=40,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=40)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/metafile.yml b/configs/convnext_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..86baa586ec6824603351cc70348c219f68fa71a2
--- /dev/null
+++ b/configs/convnext_v2/metafile.yml
@@ -0,0 +1,433 @@
+Collections:
+ - Name: ConvNeXt V2
+ Metadata:
+ Architecture:
+ - Global Response Normalization
+ Paper:
+ Title: Co-designing and Scaling ConvNets with Masked Autoencoders
+ URL: http://arxiv.org/abs/2301.00808
+ README: configs/convnext_v2/README.md
+
+Models:
+ - Name: convnext-v2-atto_3rdparty-fcmae_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 551718080
+ Parameters: 3708400
+ In Collection: ConvNeXt V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_3rdparty-fcmae_in1k_20230104-07514db4.pth
+ Config: configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_atto_1k_224_fcmae.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-atto_fcmae-pre_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 551718080
+ Parameters: 3708400
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.64
+ Top 5 Accuracy: 93.04
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_fcmae-pre_3rdparty_in1k_20230104-23765f83.pth
+ Config: configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_atto_1k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-femto_3rdparty-fcmae_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 784892544
+ Parameters: 5233240
+ In Collection: ConvNeXt V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_3rdparty-fcmae_in1k_20230104-adbe2082.pth
+ Config: configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_femto_1k_224_fcmae.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-femto_fcmae-pre_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 784892544
+ Parameters: 5233240
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.48
+ Top 5 Accuracy: 93.98
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_fcmae-pre_3rdparty_in1k_20230104-92a75d75.pth
+ Config: configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_femto_1k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-pico_3rdparty-fcmae_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 1374072320
+ Parameters: 9066280
+ In Collection: ConvNeXt V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_3rdparty-fcmae_in1k_20230104-147b1b59.pth
+ Config: configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_pico_1k_224_fcmae.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-pico_fcmae-pre_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 1374072320
+ Parameters: 9066280
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.31
+ Top 5 Accuracy: 95.08
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_fcmae-pre_3rdparty_in1k_20230104-d20263ca.pth
+ Config: configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_pico_1k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-nano_3rdparty-fcmae_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 2454926720
+ Parameters: 15623800
+ In Collection: ConvNeXt V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_3rdparty-fcmae_in1k_20230104-3dd1f29e.pth
+ Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_nano_1k_224_fcmae.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-nano_fcmae-pre_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 2454926720
+ Parameters: 15623800
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.86
+ Top 5 Accuracy: 95.75
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-pre_3rdparty_in1k_20230104-fe1aaaf2.pth
+ Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_nano_1k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 2454926720
+ Parameters: 15623800
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.04
+ Top 5 Accuracy: 96.16
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k_20230104-91fa8ae2.pth
+ Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_nano_22k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-tiny_3rdparty-fcmae_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 4469631744
+ Parameters: 28635496
+ In Collection: ConvNeXt V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_3rdparty-fcmae_in1k_20230104-80513adc.pth
+ Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_tiny_1k_224_fcmae.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-tiny_fcmae-pre_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 4469631744
+ Parameters: 28635496
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.94
+ Top 5 Accuracy: 96.29
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-pre_3rdparty_in1k_20230104-471a86de.pth
+ Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_tiny_1k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 4469631744
+ Parameters: 28635496
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.89
+ Top 5 Accuracy: 96.96
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k_20230104-8cc8b8f2.pth
+ Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_tiny_22k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 7214472320
+ Parameters: 15623800
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.36
+ Top 5 Accuracy: 96.75
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-f951ae87.pth
+ Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_nano_22k_384_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 13135236864
+ Parameters: 28635496
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.09
+ Top 5 Accuracy: 97.63
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-d8579f84.pth
+ Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_tiny_22k_384_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-base_3rdparty-fcmae_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 15382561792
+ Parameters: 88717800
+ In Collection: ConvNeXt V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_3rdparty-fcmae_in1k_20230104-8a798eaf.pth
+ Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_base_1k_224_fcmae.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-base_fcmae-pre_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 15382561792
+ Parameters: 88717800
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.87
+ Top 5 Accuracy: 97.08
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-pre_3rdparty_in1k_20230104-00a70fa4.pth
+ Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_base_1k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 15382561792
+ Parameters: 88717800
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.74
+ Top 5 Accuracy: 98.02
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k_20230104-c48d16a5.pth
+ Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_base_22k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-large_3rdparty-fcmae_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 34403182080
+ Parameters: 197956840
+ In Collection: ConvNeXt V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_3rdparty-fcmae_in1k_20230104-bf38df92.pth
+ Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_large_1k_224_fcmae.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-large_fcmae-pre_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 34403182080
+ Parameters: 197956840
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.76
+ Top 5 Accuracy: 97.59
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-pre_3rdparty_in1k_20230104-ef393013.pth
+ Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_large_1k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 34403182080
+ Parameters: 197956840
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 87.26
+ Top 5 Accuracy: 98.24
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k_20230104-d9c4dc0c.pth
+ Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_large_22k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 45205885952
+ Parameters: 88717800
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 87.63
+ Top 5 Accuracy: 98.42
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-379425cc.pth
+ Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_base_22k_384_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 101103214080
+ Parameters: 197956840
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 88.18
+ Top 5 Accuracy: 98.52
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-9139a1f3.pth
+ Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_large_22k_384_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-huge_3rdparty-fcmae_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 114998639360
+ Parameters: 660289640
+ In Collection: ConvNeXt V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_3rdparty-fcmae_in1k_20230104-fe43ae6c.pth
+ Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_huge_1k_224_fcmae.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-huge_fcmae-pre_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 114998639360
+ Parameters: 660289640
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.25
+ Top 5 Accuracy: 97.75
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-pre_3rdparty_in1k_20230104-f795e5b8.pth
+ Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_huge_1k_224_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 337955157760
+ Parameters: 660289640
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 88.68
+ Top 5 Accuracy: 98.73
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-02a4eb35.pth
+ Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_huge_22k_384_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
+ - Name: convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 600809158400
+ Parameters: 660289640
+ In Collection: ConvNeXt V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 88.86
+ Top 5 Accuracy: 98.74
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px_20230104-ce32e63c.pth
+ Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_huge_22k_512_ema.pt
+ Code: https://github.com/facebookresearch/ConvNeXt-V2
diff --git a/configs/cspnet/README.md b/configs/cspnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f3b145ba0399b660d03233d9deb11913fbc3c438
--- /dev/null
+++ b/configs/cspnet/README.md
@@ -0,0 +1,78 @@
+# CSPNet
+
+> [CSPNet: A New Backbone that can Enhance Learning Capability of CNN](https://arxiv.org/abs/1911.11929)
+
+
+
+## Abstract
+
+Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet. Source code is at this https URL.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('cspdarknet50_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('cspdarknet50_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/cspnet/cspdarknet50_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-bd275287.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :-----------------------------------------------------------------------------: |
+| `cspdarknet50_3rdparty_8xb32_in1k`\* | From scratch | 27.64 | 5.04 | 80.05 | 95.07 | [config](cspdarknet50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-bd275287.pth) |
+| `cspresnet50_3rdparty_8xb32_in1k`\* | From scratch | 21.62 | 3.48 | 79.55 | 94.68 | [config](cspresnet50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnet50_3rdparty_8xb32_in1k_20220329-dd6dddfb.pth) |
+| `cspresnext50_3rdparty_8xb32_in1k`\* | From scratch | 20.57 | 3.11 | 79.96 | 94.96 | [config](cspresnext50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnext50_3rdparty_8xb32_in1k_20220329-2cc84d21.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/rwightman/pytorch-image-models). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{wang2020cspnet,
+ title={CSPNet: A new backbone that can enhance learning capability of CNN},
+ author={Wang, Chien-Yao and Liao, Hong-Yuan Mark and Wu, Yueh-Hua and Chen, Ping-Yang and Hsieh, Jun-Wei and Yeh, I-Hau},
+ booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops},
+ pages={390--391},
+ year={2020}
+}
+```
diff --git a/configs/cspnet/cspdarknet50_8xb32_in1k.py b/configs/cspnet/cspdarknet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..851148109e72202cd5eca721fb66023ab2934e90
--- /dev/null
+++ b/configs/cspnet/cspdarknet50_8xb32_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='CSPDarkNet', depth=53),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=288,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/cspnet/cspresnet50_8xb32_in1k.py b/configs/cspnet/cspresnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d149637aabae7b8cdf691262796becc4cfcc5efc
--- /dev/null
+++ b/configs/cspnet/cspresnet50_8xb32_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='CSPResNet', depth=50),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=288,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/cspnet/cspresnext50_8xb32_in1k.py b/configs/cspnet/cspresnext50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f8c15c12f6ab42349eda2a3680f07eabb855448
--- /dev/null
+++ b/configs/cspnet/cspresnext50_8xb32_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='CSPResNeXt', depth=50),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/cspnet/metafile.yml b/configs/cspnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..31036325f6e9c96c574a303f60990e28fe7822b9
--- /dev/null
+++ b/configs/cspnet/metafile.yml
@@ -0,0 +1,64 @@
+Collections:
+ - Name: CSPNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Cross Stage Partia Stage
+ Paper:
+ URL: https://arxiv.org/abs/1911.11929
+ Title: 'CSPNet: A New Backbone that can Enhance Learning Capability of CNN'
+ README: configs/cspnet/README.md
+ Code:
+ Version: v0.22.0
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.22.0/mmcls/models/backbones/cspnet.py
+
+Models:
+ - Name: cspdarknet50_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 5040000000
+ Parameters: 27640000
+ In Collection: CSPNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.05
+ Top 5 Accuracy: 95.07
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-bd275287.pth
+ Config: configs/cspnet/cspdarknet50_8xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/cspdarknet53_ra_256-d05c7c21.pth
+ Code: https://github.com/rwightman/pytorch-image-models
+ - Name: cspresnet50_3rdparty_8xb32_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 3480000000
+ Parameters: 21620000
+ In Collection: CSPNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.55
+ Top 5 Accuracy: 94.68
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnet50_3rdparty_8xb32_in1k_20220329-dd6dddfb.pth
+ Config: configs/cspnet/cspresnet50_8xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/cspresnet50_ra-d3e8d487.pth
+ Code: https://github.com/rwightman/pytorch-image-models
+ - Name: cspresnext50_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 3110000000
+ Parameters: 20570000
+ In Collection: CSPNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.96
+ Top 5 Accuracy: 94.96
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnext50_3rdparty_8xb32_in1k_20220329-2cc84d21.pth
+ Config: configs/cspnet/cspresnext50_8xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/cspresnext50_ra_224-648b4713.pth
+ Code: https://github.com/rwightman/pytorch-image-models
diff --git a/configs/csra/README.md b/configs/csra/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..99b29571c9e602d501518c0fdfcd490cee83f183
--- /dev/null
+++ b/configs/csra/README.md
@@ -0,0 +1,73 @@
+# CSRA
+
+> [Residual Attention: A Simple but Effective Method for Multi-Label Recognition](https://arxiv.org/abs/2108.02456)
+
+
+
+## Abstract
+
+Multi-label image recognition is a challenging computer vision task of practical use. Progresses in this area, however, are often characterized by complicated methods, heavy computations, and lack of intuitive explanations. To effectively capture different spatial regions occupied by objects from different categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). CSRA generates class-specific features for every category by proposing a simple spatial attention score, and then combines it with the class-agnostic average pooling feature. CSRA achieves state-of-the-art results on multilabel recognition, and at the same time is much simpler than them. Furthermore, with only 4 lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets without any extra training. CSRA is both easy to implement and light in computations, which also enjoys intuitive explanations and visualizations.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet101-csra_1xb16_voc07-448px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/csra/resnet101-csra_1xb16_voc07-448px.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/csra/resnet101-csra_1xb16_voc07-448px.py https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.pth
+```
+
+
+
+## Models and results
+
+### Multi-Label Classification on PASCAL VOC 2007
+
+| Model | Pretrain | Params (M) | Flops (G) | CF1 | OF1 | mAP | Config | Download |
+| :--------------------------------- | :----------: | :--------: | :-------: | :---: | :---: | :---: | :-------------------------------------------: | :-------------------------------------------------------------------------: |
+| `resnet101-csra_1xb16_voc07-448px` | From scratch | 23.55 | 4.12 | 89.16 | 90.80 | 94.98 | [config](resnet101-csra_1xb16_voc07-448px.py) | [model](https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.json) |
+
+## Citation
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2108.02456,
+ doi = {10.48550/ARXIV.2108.02456},
+ url = {https://arxiv.org/abs/2108.02456},
+ author = {Zhu, Ke and Wu, Jianxin},
+ keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+ title = {Residual Attention: A Simple but Effective Method for Multi-Label Recognition},
+ publisher = {arXiv},
+ year = {2021},
+ copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```
diff --git a/configs/csra/metafile.yml b/configs/csra/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..112f50c9d44e1bc12359653f89920b93eae67361
--- /dev/null
+++ b/configs/csra/metafile.yml
@@ -0,0 +1,29 @@
+Collections:
+ - Name: CSRA
+ Metadata:
+ Training Data: PASCAL VOC 2007
+ Architecture:
+ - Class-specific Residual Attention
+ Paper:
+ URL: https://arxiv.org/abs/2108.02456
+ Title: 'Residual Attention: A Simple but Effective Method for Multi-Label Recognition'
+ README: configs/csra/README.md
+ Code:
+ Version: v0.24.0
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.24.0/mmcls/models/heads/multi_label_csra_head.py
+
+Models:
+ - Name: resnet101-csra_1xb16_voc07-448px
+ Metadata:
+ FLOPs: 4120000000
+ Parameters: 23550000
+ In Collection: CSRA
+ Results:
+ - Dataset: PASCAL VOC 2007
+ Metrics:
+ mAP: 94.98
+ OF1: 90.80
+ CF1: 89.16
+ Task: Multi-Label Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.pth
+ Config: configs/csra/resnet101-csra_1xb16_voc07-448px.py
diff --git a/configs/csra/resnet101-csra_1xb16_voc07-448px.py b/configs/csra/resnet101-csra_1xb16_voc07-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..85135ae215c072accb4038b1a3fb4b3b796a6072
--- /dev/null
+++ b/configs/csra/resnet101-csra_1xb16_voc07-448px.py
@@ -0,0 +1,78 @@
+_base_ = ['../_base_/datasets/voc_bs16.py', '../_base_/default_runtime.py']
+
+# Pre-trained Checkpoint Path
+checkpoint = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.pth' # noqa
+# If you want to use the pre-trained weight of ResNet101-CutMix from the
+# originary repo(https://github.com/Kevinz-code/CSRA). Script of
+# 'tools/model_converters/torchvision_to_mmpretrain.py' can help you convert
+# weight into mmpretrain format. The mAP result would hit 95.5 by using the
+# weight. checkpoint = 'PATH/TO/PRE-TRAINED_WEIGHT'
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet',
+ depth=101,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch',
+ init_cfg=dict(
+ type='Pretrained', checkpoint=checkpoint, prefix='backbone')),
+ neck=None,
+ head=dict(
+ type='CSRAClsHead',
+ num_classes=20,
+ in_channels=2048,
+ num_heads=1,
+ lam=0.1,
+ loss=dict(type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)))
+
+# dataset setting
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0, 0, 0],
+ std=[255, 255, 255])
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=448, crop_ratio_range=(0.7, 1.0)),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=448),
+ dict(
+ type='PackInputs',
+ # `gt_label_difficult` is needed for VOC evaluation
+ meta_keys=('sample_idx', 'img_path', 'ori_shape', 'img_shape',
+ 'scale_factor', 'flip', 'flip_direction',
+ 'gt_label_difficult')),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer
+# the lr of classifier.head is 10 * base_lr, which help convergence.
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.0002, momentum=0.9, weight_decay=0.0001),
+ paramwise_cfg=dict(custom_keys={'head': dict(lr_mult=10)}))
+
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-7,
+ by_epoch=True,
+ begin=0,
+ end=1,
+ convert_to_iter_based=True),
+ dict(type='StepLR', by_epoch=True, step_size=6, gamma=0.1)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/davit/README.md b/configs/davit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1be19d98e37d4bf75dcc3d89ce689d09512b0505
--- /dev/null
+++ b/configs/davit/README.md
@@ -0,0 +1,77 @@
+# DaViT
+
+> [DaViT: Dual Attention Vision Transformers](https://arxiv.org/abs/2204.03645v1)
+
+
+
+## Abstract
+
+In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('davit-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('davit-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/davit/davit-tiny_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `davit-tiny_3rdparty_in1k`\* | From scratch | 28.36 | 4.54 | 82.24 | 96.13 | [config](davit-tiny_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d.pth) |
+| `davit-small_3rdparty_in1k`\* | From scratch | 49.75 | 8.80 | 83.61 | 96.75 | [config](davit-small_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/davit/davit-small_3rdparty_in1k_20221116-51a849a6.pth) |
+| `davit-base_3rdparty_in1k`\* | From scratch | 87.95 | 15.51 | 84.09 | 96.82 | [config](davit-base_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/davit/davit-base_3rdparty_in1k_20221116-19e0d956.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{ding2022davit,
+ title={DaViT: Dual Attention Vision Transformer},
+ author={Ding, Mingyu and Xiao, Bin and Codella, Noel and Luo, Ping and Wang, Jingdong and Yuan, Lu},
+ booktitle={ECCV},
+ year={2022},
+}
+```
diff --git a/configs/davit/davit-base_4xb256_in1k.py b/configs/davit/davit-base_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..071702fa7b69a3d893d9999ecf9ace28afbe193d
--- /dev/null
+++ b/configs/davit/davit-base_4xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/davit/davit-base.py',
+ '../_base_/datasets/imagenet_bs256_davit_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# data settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/davit/davit-small_4xb256_in1k.py b/configs/davit/davit-small_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e341031016c53b57adb477093f89b4524c6db4c1
--- /dev/null
+++ b/configs/davit/davit-small_4xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/davit/davit-small.py',
+ '../_base_/datasets/imagenet_bs256_davit_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# data settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/davit/davit-tiny_4xb256_in1k.py b/configs/davit/davit-tiny_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a16d87f4630b73fd4d76b52bbe926cb75dbb1d03
--- /dev/null
+++ b/configs/davit/davit-tiny_4xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/davit/davit-tiny.py',
+ '../_base_/datasets/imagenet_bs256_davit_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# data settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/davit/metafile.yml b/configs/davit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..588c18fd6dade71ff114a724a42a68a1a38b72bc
--- /dev/null
+++ b/configs/davit/metafile.yml
@@ -0,0 +1,71 @@
+Collections:
+ - Name: DaViT
+ Metadata:
+ Architecture:
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ Paper:
+ URL: https://arxiv.org/abs/2204.03645v1
+ Title: 'DaViT: Dual Attention Vision Transformers'
+ README: configs/davit/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/backbones/davit.py
+ Version: v1.0.0rc3
+
+Models:
+ - Name: davit-tiny_3rdparty_in1k
+ In Collection: DaViT
+ Metadata:
+ FLOPs: 4539698688
+ Parameters: 28360168
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 82.24
+ Top 5 Accuracy: 96.13
+ Weights: https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d.pth
+ Converted From:
+ Weights: https://drive.google.com/file/d/1RSpi3lxKaloOL5-or20HuG975tbPwxRZ/view?usp=sharing
+ Code: https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355
+ Config: configs/davit/davit-tiny_4xb256_in1k.py
+ - Name: davit-small_3rdparty_in1k
+ In Collection: DaViT
+ Metadata:
+ FLOPs: 8799942144
+ Parameters: 49745896
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 83.61
+ Top 5 Accuracy: 96.75
+ Weights: https://download.openmmlab.com/mmclassification/v0/davit/davit-small_3rdparty_in1k_20221116-51a849a6.pth
+ Converted From:
+ Weights: https://drive.google.com/file/d/1q976ruj45mt0RhO9oxhOo6EP_cmj4ahQ/view?usp=sharing
+ Code: https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355
+ Config: configs/davit/davit-small_4xb256_in1k.py
+ - Name: davit-base_3rdparty_in1k
+ In Collection: DaViT
+ Metadata:
+ FLOPs: 15509702656
+ Parameters: 87954408
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 84.09
+ Top 5 Accuracy: 96.82
+ Weights: https://download.openmmlab.com/mmclassification/v0/davit/davit-base_3rdparty_in1k_20221116-19e0d956.pth
+ Converted From:
+ Weights: https://drive.google.com/file/d/1u9sDBEueB-YFuLigvcwf4b2YyA4MIVsZ/view?usp=sharing
+ Code: https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355
+ Config: configs/davit/davit-base_4xb256_in1k.py
diff --git a/configs/deit/README.md b/configs/deit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ee434140a4316fed147c171ea425b6deff2aead6
--- /dev/null
+++ b/configs/deit/README.md
@@ -0,0 +1,97 @@
+# DeiT
+
+> [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877)
+
+
+
+## Abstract
+
+Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('deit-tiny_4xb256_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('deit-tiny_4xb256_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/deit/deit-tiny_4xb256_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/deit/deit-tiny_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------------: | :--------------------------------------------------: |
+| `deit-tiny_4xb256_in1k` | From scratch | 5.72 | 1.26 | 74.50 | 92.24 | [config](deit-tiny_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.json) |
+| `deit-tiny-distilled_3rdparty_in1k`\* | From scratch | 5.91 | 1.27 | 74.51 | 91.90 | [config](deit-tiny-distilled_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny-distilled_3rdparty_pt-4xb256_in1k_20211216-c429839a.pth) |
+| `deit-small_4xb256_in1k` | From scratch | 22.05 | 4.61 | 80.69 | 95.06 | [config](deit-small_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb.json) |
+| `deit-small-distilled_3rdparty_in1k`\* | From scratch | 22.44 | 4.63 | 81.17 | 95.40 | [config](deit-small-distilled_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-small-distilled_3rdparty_pt-4xb256_in1k_20211216-4de1d725.pth) |
+| `deit-base_16xb64_in1k` | From scratch | 86.57 | 17.58 | 81.76 | 95.81 | [config](deit-base_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c.json) |
+| `deit-base_3rdparty_in1k`\* | From scratch | 86.57 | 17.58 | 81.79 | 95.59 | [config](deit-base_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_pt-16xb64_in1k_20211124-6f40c188.pth) |
+| `deit-base-distilled_3rdparty_in1k`\* | From scratch | 87.34 | 17.67 | 83.33 | 96.49 | [config](deit-base-distilled_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_pt-16xb64_in1k_20211216-42891296.pth) |
+| `deit-base_224px-pre_3rdparty_in1k-384px`\* | 224px | 86.86 | 55.54 | 83.04 | 96.31 | [config](deit-base_16xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_ft-16xb32_in1k-384px_20211124-822d02f2.pth) |
+| `deit-base-distilled_224px-pre_3rdparty_in1k-384px`\* | 224px | 87.63 | 55.65 | 85.55 | 97.35 | [config](deit-base-distilled_16xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_ft-16xb32_in1k-384px_20211216-e48d6000.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L168). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+```{warning}
+MMPretrain doesn't support training the distilled version DeiT.
+And we provide distilled version checkpoints for inference only.
+```
+
+## Citation
+
+```bibtex
+@InProceedings{pmlr-v139-touvron21a,
+ title = {Training data-efficient image transformers & distillation through attention},
+ author = {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
+ booktitle = {International Conference on Machine Learning},
+ pages = {10347--10357},
+ year = {2021},
+ volume = {139},
+ month = {July}
+}
+```
diff --git a/configs/deit/deit-base-distilled_16xb32_in1k-384px.py b/configs/deit/deit-base-distilled_16xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..60d3112fd530917d2196a24c25d8d0223731c52d
--- /dev/null
+++ b/configs/deit/deit-base-distilled_16xb32_in1k-384px.py
@@ -0,0 +1,37 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DistilledVisionTransformer',
+ arch='deit-base',
+ img_size=384,
+ patch_size=16,
+ ),
+ neck=None,
+ head=dict(
+ type='DeiTClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ # Change to the path of the pretrained model
+ # init_cfg=dict(type='Pretrained', checkpoint=''),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (16 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/deit/deit-base-distilled_16xb64_in1k.py b/configs/deit/deit-base-distilled_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..207bf250f62f3317df6535cf9b7e8dd0b4a1f5ac
--- /dev/null
+++ b/configs/deit/deit-base-distilled_16xb64_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DistilledVisionTransformer',
+ arch='deit-base',
+ img_size=224,
+ patch_size=16),
+ neck=None,
+ head=dict(
+ type='DeiTClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-base_16xb32_in1k-384px.py b/configs/deit/deit-base_16xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..762b4604348d1e8f0940f0243c9c824215d4b207
--- /dev/null
+++ b/configs/deit/deit-base_16xb32_in1k-384px.py
@@ -0,0 +1,37 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='deit-base',
+ img_size=384,
+ patch_size=16,
+ ),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ # Change to the path of the pretrained model
+ # init_cfg=dict(type='Pretrained', checkpoint=''),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (16 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/deit/deit-base_16xb64_in1k.py b/configs/deit/deit-base_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..66f03a99f20a10649a954c15b2aa9c44374704fe
--- /dev/null
+++ b/configs/deit/deit-base_16xb64_in1k.py
@@ -0,0 +1,50 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='deit-base',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.1),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/deit/deit-small-distilled_4xb256_in1k.py b/configs/deit/deit-small-distilled_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c7c58cb3d76e8b36f766080e4ec7de056a0621b
--- /dev/null
+++ b/configs/deit/deit-small-distilled_4xb256_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DistilledVisionTransformer',
+ arch='deit-small',
+ img_size=224,
+ patch_size=16),
+ neck=None,
+ head=dict(
+ type='DeiTClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-small_4xb256_in1k.py b/configs/deit/deit-small_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b96d84ec46bf2badd08b69fddaa2d8b8109b1ebf
--- /dev/null
+++ b/configs/deit/deit-small_4xb256_in1k.py
@@ -0,0 +1,48 @@
+# In small and tiny arch, remove drop path and EMA hook comparing with the
+# original config
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='deit-small',
+ img_size=224,
+ patch_size=16),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-tiny-distilled_4xb256_in1k.py b/configs/deit/deit-tiny-distilled_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..00a9c4bd214a7c3d3eb1163b73aeb70251ce1bbc
--- /dev/null
+++ b/configs/deit/deit-tiny-distilled_4xb256_in1k.py
@@ -0,0 +1,47 @@
+# The distillation config is only for evaluation.
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='DistilledVisionTransformer',
+ arch='deit-tiny',
+ img_size=224,
+ patch_size=16),
+ neck=None,
+ head=dict(
+ type='DeiTClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-tiny_4xb256_in1k.py b/configs/deit/deit-tiny_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..486669e9c16e01ccc3d469c55bb04e714225b624
--- /dev/null
+++ b/configs/deit/deit-tiny_4xb256_in1k.py
@@ -0,0 +1,48 @@
+# In small and tiny arch, remove drop path and EMA hook comparing with the
+# original config
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='deit-tiny',
+ img_size=224,
+ patch_size=16),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/metafile.yml b/configs/deit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f6f0c5e56a4f72fc7df812705b9d2ec4a6a589bb
--- /dev/null
+++ b/configs/deit/metafile.yml
@@ -0,0 +1,153 @@
+Collections:
+ - Name: DeiT
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Layer Normalization
+ - Scaled Dot-Product Attention
+ - Attention Dropout
+ - Multi-Head Attention
+ Paper:
+ Title: Training data-efficient image transformers & distillation through attention
+ URL: https://arxiv.org/abs/2012.12877
+ README: configs/deit/README.md
+ Code:
+ URL: v0.19.0
+ Version: https://github.com/open-mmlab/mmpretrain/blob/v0.19.0/mmcls/models/backbones/deit.py
+
+Models:
+ - Name: deit-tiny_4xb256_in1k
+ Metadata:
+ FLOPs: 1258219200
+ Parameters: 5717416
+ In Collection: DeiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.5
+ Top 5 Accuracy: 92.24
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth
+ Config: configs/deit/deit-tiny_4xb256_in1k.py
+ - Name: deit-tiny-distilled_3rdparty_in1k
+ Metadata:
+ FLOPs: 1265371776
+ Parameters: 5910800
+ In Collection: DeiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.51
+ Top 5 Accuracy: 91.9
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny-distilled_3rdparty_pt-4xb256_in1k_20211216-c429839a.pth
+ Config: configs/deit/deit-tiny-distilled_4xb256_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_tiny_distilled_patch16_224-b40b3cf7.pth
+ Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L108
+ - Name: deit-small_4xb256_in1k
+ Metadata:
+ FLOPs: 4607954304
+ Parameters: 22050664
+ In Collection: DeiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.69
+ Top 5 Accuracy: 95.06
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb.pth
+ Config: configs/deit/deit-small_4xb256_in1k.py
+ - Name: deit-small-distilled_3rdparty_in1k
+ Metadata:
+ FLOPs: 4632876288
+ Parameters: 22436432
+ In Collection: DeiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.17
+ Top 5 Accuracy: 95.4
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-small-distilled_3rdparty_pt-4xb256_in1k_20211216-4de1d725.pth
+ Config: configs/deit/deit-small-distilled_4xb256_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_small_distilled_patch16_224-649709d9.pth
+ Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L123
+ - Name: deit-base_16xb64_in1k
+ Metadata:
+ FLOPs: 17581972224
+ Parameters: 86567656
+ In Collection: DeiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.76
+ Top 5 Accuracy: 95.81
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c.pth
+ Config: configs/deit/deit-base_16xb64_in1k.py
+ - Name: deit-base_3rdparty_in1k
+ Metadata:
+ FLOPs: 17581972224
+ Parameters: 86567656
+ In Collection: DeiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.79
+ Top 5 Accuracy: 95.59
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_pt-16xb64_in1k_20211124-6f40c188.pth
+ Config: configs/deit/deit-base_16xb64_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth
+ Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L93
+ - Name: deit-base-distilled_3rdparty_in1k
+ Metadata:
+ FLOPs: 17674283520
+ Parameters: 87338192
+ In Collection: DeiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.33
+ Top 5 Accuracy: 96.49
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_pt-16xb64_in1k_20211216-42891296.pth
+ Config: configs/deit/deit-base-distilled_16xb64_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_224-df68dfff.pth
+ Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L138
+ - Name: deit-base_224px-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 55538974464
+ Parameters: 86859496
+ In Collection: DeiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.04
+ Top 5 Accuracy: 96.31
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_ft-16xb32_in1k-384px_20211124-822d02f2.pth
+ Config: configs/deit/deit-base_16xb32_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_base_patch16_384-8de9b5d1.pth
+ Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L153
+ - Name: deit-base-distilled_224px-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 55645294080
+ Parameters: 87630032
+ In Collection: DeiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.55
+ Top 5 Accuracy: 97.35
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_ft-16xb32_in1k-384px_20211216-e48d6000.pth
+ Config: configs/deit/deit-base-distilled_16xb32_in1k-384px.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_384-d0272ac0.pth
+ Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L168
diff --git a/configs/deit3/README.md b/configs/deit3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..18694b7eb9b97589aece3c9bfc7187b9c9d83841
--- /dev/null
+++ b/configs/deit3/README.md
@@ -0,0 +1,90 @@
+# DeiT III: Revenge of the ViT
+
+> [DeiT III: Revenge of the ViT](https://arxiv.org/abs/2204.07118)
+
+
+
+## Abstract
+
+A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('deit3-small-p16_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('deit3-small-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/deit3/deit3-small-p16_64xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k_20221008-0f7c70cf.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :------------------------------------------------------: |
+| `deit3-small-p16_3rdparty_in1k`\* | From scratch | 22.06 | 4.61 | 81.35 | 95.31 | [config](deit3-small-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k_20221008-0f7c70cf.pth) |
+| `deit3-small-p16_3rdparty_in1k-384px`\* | From scratch | 22.21 | 15.52 | 83.43 | 96.68 | [config](deit3-small-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k-384px_20221008-a2c1a0c7.pth) |
+| `deit3-small-p16_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 22.06 | 4.61 | 83.06 | 96.77 | [config](deit3-small-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k_20221009-dcd90827.pth) |
+| `deit3-small-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 22.21 | 15.52 | 84.84 | 97.48 | [config](deit3-small-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k-384px_20221009-de116dd7.pth) |
+| `deit3-medium-p16_3rdparty_in1k`\* | From scratch | 38.85 | 8.00 | 82.99 | 96.22 | [config](deit3-medium-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_3rdparty_in1k_20221008-3b21284d.pth) |
+| `deit3-medium-p16_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 38.85 | 8.00 | 84.56 | 97.19 | [config](deit3-medium-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_in21k-pre_3rdparty_in1k_20221009-472f11e2.pth) |
+| `deit3-base-p16_3rdparty_in1k`\* | From scratch | 86.59 | 17.58 | 83.80 | 96.55 | [config](deit3-base-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k_20221008-60b8c8bf.pth) |
+| `deit3-base-p16_3rdparty_in1k-384px`\* | From scratch | 86.88 | 55.54 | 85.08 | 97.25 | [config](deit3-base-p16_64xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k-384px_20221009-e19e36d4.pth) |
+| `deit3-base-p16_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 86.59 | 17.58 | 85.70 | 97.75 | [config](deit3-base-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k_20221009-87983ca1.pth) |
+| `deit3-base-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 86.88 | 55.54 | 86.73 | 98.11 | [config](deit3-base-p16_64xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k-384px_20221009-5e4e37b9.pth) |
+| `deit3-large-p16_3rdparty_in1k`\* | From scratch | 304.37 | 61.60 | 84.87 | 97.01 | [config](deit3-large-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k_20221009-03b427ea.pth) |
+| `deit3-large-p16_3rdparty_in1k-384px`\* | From scratch | 304.76 | 191.21 | 85.82 | 97.60 | [config](deit3-large-p16_64xb16_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k-384px_20221009-4317ce62.pth) |
+| `deit3-large-p16_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 304.37 | 61.60 | 86.97 | 98.24 | [config](deit3-large-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k_20221009-d8d27084.pth) |
+| `deit3-large-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 304.76 | 191.21 | 87.73 | 98.51 | [config](deit3-large-p16_64xb16_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k-384px_20221009-75fea03f.pth) |
+| `deit3-huge-p14_3rdparty_in1k`\* | From scratch | 632.13 | 167.40 | 85.21 | 97.36 | [config](deit3-huge-p14_64xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_3rdparty_in1k_20221009-e107bcb7.pth) |
+| `deit3-huge-p14_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 632.13 | 167.40 | 87.19 | 98.26 | [config](deit3-huge-p14_64xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_in21k-pre_3rdparty_in1k_20221009-19b8a535.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Touvron2022DeiTIR,
+ title={DeiT III: Revenge of the ViT},
+ author={Hugo Touvron and Matthieu Cord and Herve Jegou},
+ journal={arXiv preprint arXiv:2204.07118},
+ year={2022},
+}
+```
diff --git a/configs/deit3/deit3-base-p16_64xb32_in1k-384px.py b/configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6c8a8c411ee96a88bc44c042cdf134a36eb05da
--- /dev/null
+++ b/configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/deit3/deit3-base-p16-384.py',
+ '../_base_/datasets/imagenet_bs64_deit3_384.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/deit3/deit3-base-p16_64xb64_in1k.py b/configs/deit3/deit3-base-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c69a64cdd06da1e868bb08e9eec5cbf9b82f5aa9
--- /dev/null
+++ b/configs/deit3/deit3-base-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/deit3/deit3-base-p16-224.py',
+ '../_base_/datasets/imagenet_bs64_deit3_224.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-huge-p14_64xb32_in1k.py b/configs/deit3/deit3-huge-p14_64xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8cae075b6a28f8519390983621b2dc98173e507
--- /dev/null
+++ b/configs/deit3/deit3-huge-p14_64xb32_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/deit3/deit3-huge-p14-224.py',
+ '../_base_/datasets/imagenet_bs64_deit3_224.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/deit3/deit3-large-p16_64xb16_in1k-384px.py b/configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..84fb0feae636a3f3c4b2297ed6935e817701cbea
--- /dev/null
+++ b/configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/deit3/deit3-large-p16-384.py',
+ '../_base_/datasets/imagenet_bs64_deit3_384.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=16)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (16 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1025)
diff --git a/configs/deit3/deit3-large-p16_64xb64_in1k.py b/configs/deit3/deit3-large-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a67ac21f9ba3fefdb7e22429e565fb6ee6eeff86
--- /dev/null
+++ b/configs/deit3/deit3-large-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/deit3/deit3-large-p16-224.py',
+ '../_base_/datasets/imagenet_bs64_deit3_224.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-medium-p16_64xb64_in1k.py b/configs/deit3/deit3-medium-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..def48e682a5fa66e166f4419b8e1850e26f75d17
--- /dev/null
+++ b/configs/deit3/deit3-medium-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/deit3/deit3-medium-p16-224.py',
+ '../_base_/datasets/imagenet_bs64_deit3_224.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-small-p16_64xb64_in1k-384px.py b/configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6b3e892c34268d2bdfeb9f7ab7f1808ea203558
--- /dev/null
+++ b/configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/deit3/deit3-small-p16-384.py',
+ '../_base_/datasets/imagenet_bs64_deit3_384.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-small-p16_64xb64_in1k.py b/configs/deit3/deit3-small-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..58b0a2f1837e09edc3c43d6776fda169e4b0480b
--- /dev/null
+++ b/configs/deit3/deit3-small-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/deit3/deit3-small-p16-224.py',
+ '../_base_/datasets/imagenet_bs64_deit3_224.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/metafile.yml b/configs/deit3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6f50fdc396c017fcbf3d2542f6fe52c0ed5bf546
--- /dev/null
+++ b/configs/deit3/metafile.yml
@@ -0,0 +1,310 @@
+Collections:
+ - Name: DeiT3
+ Metadata:
+ Architecture:
+ - Attention Dropout
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ - Tanh Activation
+ Paper:
+ URL: https://arxiv.org/abs/2204.07118
+ Title: 'DeiT III: Revenge of the ViT'
+ README: configs/deit3/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc2/mmcls/models/backbones/deit3.py
+ Version: v1.0.0rc2
+
+Models:
+ - Name: deit3-small-p16_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 4607954304
+ Parameters: 22059496
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 81.35
+ Top 5 Accuracy: 95.31
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k_20221008-0f7c70cf.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_224_1k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-small-p16_64xb64_in1k.py
+ - Name: deit3-small-p16_3rdparty_in1k-384px
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 15517663104
+ Parameters: 22205416
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 83.43
+ Top 5 Accuracy: 96.68
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k-384px_20221008-a2c1a0c7.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_384_1k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
+ - Name: deit3-small-p16_in21k-pre_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 4607954304
+ Parameters: 22059496
+ Training Data:
+ - ImageNet-21k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 83.06
+ Top 5 Accuracy: 96.77
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k_20221009-dcd90827.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_224_21k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-small-p16_64xb64_in1k.py
+ - Name: deit3-small-p16_in21k-pre_3rdparty_in1k-384px
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 15517663104
+ Parameters: 22205416
+ Training Data:
+ - ImageNet-21k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 84.84
+ Top 5 Accuracy: 97.48
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k-384px_20221009-de116dd7.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_384_21k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
+ - Name: deit3-medium-p16_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 8003064320
+ Parameters: 38849512
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 82.99
+ Top 5 Accuracy: 96.22
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_3rdparty_in1k_20221008-3b21284d.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_medium_224_1k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-medium-p16_64xb64_in1k.py
+ - Name: deit3-medium-p16_in21k-pre_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 8003064320
+ Parameters: 38849512
+ Training Data:
+ - ImageNet-21k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 84.56
+ Top 5 Accuracy: 97.19
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_in21k-pre_3rdparty_in1k_20221009-472f11e2.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_medium_224_21k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-medium-p16_64xb64_in1k.py
+ - Name: deit3-base-p16_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 17581972224
+ Parameters: 86585320
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 83.80
+ Top 5 Accuracy: 96.55
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k_20221008-60b8c8bf.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_224_1k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-base-p16_64xb64_in1k.py
+ - Name: deit3-base-p16_3rdparty_in1k-384px
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 55538974464
+ Parameters: 86877160
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.08
+ Top 5 Accuracy: 97.25
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k-384px_20221009-e19e36d4.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_384_1k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
+ - Name: deit3-base-p16_in21k-pre_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 17581972224
+ Parameters: 86585320
+ Training Data:
+ - ImageNet-21k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.70
+ Top 5 Accuracy: 97.75
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k_20221009-87983ca1.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_224_21k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-base-p16_64xb64_in1k.py
+ - Name: deit3-base-p16_in21k-pre_3rdparty_in1k-384px
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 55538974464
+ Parameters: 86877160
+ Training Data:
+ - ImageNet-21k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 86.73
+ Top 5 Accuracy: 98.11
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k-384px_20221009-5e4e37b9.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_384_21k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
+ - Name: deit3-large-p16_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 61603111936
+ Parameters: 304374760
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 84.87
+ Top 5 Accuracy: 97.01
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k_20221009-03b427ea.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_224_1k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-large-p16_64xb64_in1k.py
+ - Name: deit3-large-p16_3rdparty_in1k-384px
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 191210034176
+ Parameters: 304763880
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.82
+ Top 5 Accuracy: 97.60
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k-384px_20221009-4317ce62.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_384_1k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
+ - Name: deit3-large-p16_in21k-pre_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 61603111936
+ Parameters: 304374760
+ Training Data:
+ - ImageNet-21k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 86.97
+ Top 5 Accuracy: 98.24
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k_20221009-d8d27084.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_224_21k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-large-p16_64xb64_in1k.py
+ - Name: deit3-large-p16_in21k-pre_3rdparty_in1k-384px
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 191210034176
+ Parameters: 304763880
+ Training Data:
+ - ImageNet-21k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 87.73
+ Top 5 Accuracy: 98.51
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k-384px_20221009-75fea03f.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_384_21k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
+ - Name: deit3-huge-p14_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 167400741120
+ Parameters: 632126440
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.21
+ Top 5 Accuracy: 97.36
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_3rdparty_in1k_20221009-e107bcb7.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_huge_224_1k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-huge-p14_64xb32_in1k.py
+ - Name: deit3-huge-p14_in21k-pre_3rdparty_in1k
+ In Collection: DeiT3
+ Metadata:
+ FLOPs: 167400741120
+ Parameters: 632126440
+ Training Data:
+ - ImageNet-21k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 87.19
+ Top 5 Accuracy: 98.26
+ Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_in21k-pre_3rdparty_in1k_20221009-19b8a535.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/deit/deit_3_huge_224_1k.pth
+ Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+ Config: configs/deit3/deit3-huge-p14_64xb32_in1k.py
diff --git a/configs/densecl/README.md b/configs/densecl/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d1e1295d9f6a12d47196e6d2c4663d0758076167
--- /dev/null
+++ b/configs/densecl/README.md
@@ -0,0 +1,85 @@
+# DenseCL
+
+> [Dense contrastive learning for self-supervised visual pre-training](https://arxiv.org/abs/2011.09157)
+
+
+
+## Abstract
+
+To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('densecl_resnet50_8xb32-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------: | :----------------------------------------------------------------------------------------: |
+| `densecl_resnet50_8xb32-coslr-200e_in1k` | 64.85 | 4.11 | [config](densecl_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k` | [DENSECL](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) | 25.56 | 4.11 | 63.50 | [config](benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{wang2021dense,
+ title={Dense contrastive learning for self-supervised visual pre-training},
+ author={Wang, Xinlong and Zhang, Rufeng and Shen, Chunhua and Kong, Tao and Li, Lei},
+ booktitle={CVPR},
+ year={2021}
+}
+```
diff --git a/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py b/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..37795d9c866c5f9b26b0e016959a01677b8a216e
--- /dev/null
+++ b/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+ '../../_base_/models/resnet50.py',
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_sgd_steplr_100e.py',
+ '../../_base_/default_runtime.py',
+]
+
+model = dict(
+ backbone=dict(
+ frozen_stages=4,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=30., momentum=0.9, weight_decay=0.))
+
+# runtime settings
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py b/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a3959f1a91c1911e426563759795afeef71bea0
--- /dev/null
+++ b/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_mocov2.py',
+ '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='DenseCL',
+ queue_len=65536,
+ feat_dim=128,
+ momentum=0.001,
+ loss_lambda=0.5,
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='BN'),
+ zero_init_residual=False),
+ neck=dict(
+ type='DenseCLNeck',
+ in_channels=2048,
+ hid_channels=2048,
+ out_channels=128,
+ num_grid=None),
+ head=dict(
+ type='ContrastiveHead',
+ loss=dict(type='CrossEntropyLoss'),
+ temperature=0.2),
+)
+find_unused_parameters = True
+
+# runtime settings
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/densecl/metafile.yml b/configs/densecl/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..24449910aaa5930cbd32ec8fae18dec62ee73505
--- /dev/null
+++ b/configs/densecl/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+ - Name: DenseCL
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Architecture:
+ - ResNet
+ Paper:
+ Title: Dense contrastive learning for self-supervised visual pre-training
+ URL: https://arxiv.org/abs/2011.09157
+ README: configs/densecl/README.md
+
+Models:
+ - Name: densecl_resnet50_8xb32-coslr-200e_in1k
+ Metadata:
+ Epochs: 200
+ Batch Size: 256
+ FLOPs: 4109364224
+ Parameters: 64850560
+ Training Data: ImageNet-1k
+ In Collection: DenseCL
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth
+ Config: configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+ Downstream:
+ - resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
+ - Name: resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 256
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: DenseCL
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 63.5
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
+ Config: configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
diff --git a/configs/densenet/README.md b/configs/densenet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fe40fdd99cf069d76b4937e96ae252c5122ba953
--- /dev/null
+++ b/configs/densenet/README.md
@@ -0,0 +1,82 @@
+# DenseNet
+
+> [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993)
+
+
+
+## Abstract
+
+Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('densenet121_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('densenet121_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/densenet/densenet121_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `densenet121_3rdparty_in1k`\* | From scratch | 7.98 | 2.88 | 74.96 | 92.21 | [config](densenet121_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth) |
+| `densenet169_3rdparty_in1k`\* | From scratch | 14.15 | 3.42 | 76.08 | 93.11 | [config](densenet169_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth) |
+| `densenet201_3rdparty_in1k`\* | From scratch | 20.01 | 4.37 | 77.32 | 93.64 | [config](densenet201_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth) |
+| `densenet161_3rdparty_in1k`\* | From scratch | 28.68 | 7.82 | 77.61 | 93.83 | [config](densenet161_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.1608.06993,
+ doi = {10.48550/ARXIV.1608.06993},
+ url = {https://arxiv.org/abs/1608.06993},
+ author = {Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q.},
+ keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+ title = {Densely Connected Convolutional Networks},
+ publisher = {arXiv},
+ year = {2016},
+ copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```
diff --git a/configs/densenet/densenet121_4xb256_in1k.py b/configs/densenet/densenet121_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc9854f5b44da27bcf4a5a4d5faefca625dc85b0
--- /dev/null
+++ b/configs/densenet/densenet121_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/densenet/densenet121.py',
+ '../_base_/datasets/imagenet_bs64.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/densenet161_4xb256_in1k.py b/configs/densenet/densenet161_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a28a278bfc8132f4099afc576c43b05fd4095fd0
--- /dev/null
+++ b/configs/densenet/densenet161_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/densenet/densenet161.py',
+ '../_base_/datasets/imagenet_bs64.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/densenet169_4xb256_in1k.py b/configs/densenet/densenet169_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..73469da115d23da250d790d68a36f55fb8eccfff
--- /dev/null
+++ b/configs/densenet/densenet169_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/densenet/densenet169.py',
+ '../_base_/datasets/imagenet_bs64.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/densenet201_4xb256_in1k.py b/configs/densenet/densenet201_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a9b7b1923351fc1f47ad1aa0e4470316e076e96
--- /dev/null
+++ b/configs/densenet/densenet201_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/densenet/densenet201.py',
+ '../_base_/datasets/imagenet_bs64.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/metafile.yml b/configs/densenet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40575acb6b4314d8ebc5c9317e9e032e0a8b0cba
--- /dev/null
+++ b/configs/densenet/metafile.yml
@@ -0,0 +1,76 @@
+Collections:
+ - Name: DenseNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - DenseBlock
+ Paper:
+ URL: https://arxiv.org/abs/1608.06993
+ Title: Densely Connected Convolutional Networks
+ README: configs/densenet/README.md
+
+Models:
+ - Name: densenet121_3rdparty_in1k
+ Metadata:
+ FLOPs: 2881695488
+ Parameters: 7978856
+ In Collection: DenseNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.96
+ Top 5 Accuracy: 92.21
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
+ Config: configs/densenet/densenet121_4xb256_in1k.py
+ Converted From:
+ Weights: https://download.pytorch.org/models/densenet121-a639ec97.pth
+ Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+ - Name: densenet169_3rdparty_in1k
+ Metadata:
+ FLOPs: 3416860160
+ Parameters: 14149480
+ In Collection: DenseNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.08
+ Top 5 Accuracy: 93.11
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth
+ Config: configs/densenet/densenet169_4xb256_in1k.py
+ Converted From:
+ Weights: https://download.pytorch.org/models/densenet169-b2777c0a.pth
+ Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+ - Name: densenet201_3rdparty_in1k
+ Metadata:
+ FLOPs: 4365236736
+ Parameters: 20013928
+ In Collection: DenseNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.32
+ Top 5 Accuracy: 93.64
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth
+ Config: configs/densenet/densenet201_4xb256_in1k.py
+ Converted From:
+ Weights: https://download.pytorch.org/models/densenet201-c1103571.pth
+ Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+ - Name: densenet161_3rdparty_in1k
+ Metadata:
+ FLOPs: 7816363968
+ Parameters: 28681000
+ In Collection: DenseNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.61
+ Top 5 Accuracy: 93.83
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth
+ Config: configs/densenet/densenet161_4xb256_in1k.py
+ Converted From:
+ Weights: https://download.pytorch.org/models/densenet161-8d451a50.pth
+ Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
diff --git a/configs/dinov2/README.md b/configs/dinov2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..aa79d6b43c677f96236a52630b39ca9a6e381e5d
--- /dev/null
+++ b/configs/dinov2/README.md
@@ -0,0 +1,58 @@
+# DINOv2
+
+> [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
+
+
+
+## Abstract
+
+The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing allpurpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-small-p14_dinov2-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :------------------------------------ | :--------: | :-------: | :--------------------------------------------: | :------------------------------------------------------------------------------------------------: |
+| `vit-small-p14_dinov2-pre_3rdparty`\* | 22.06 | 46.76 | [config](vit-small-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth) |
+| `vit-base-p14_dinov2-pre_3rdparty`\* | 86.58 | 152.00 | [config](vit-base-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth) |
+| `vit-large-p14_dinov2-pre_3rdparty`\* | 304.00 | 507.00 | [config](vit-large-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth) |
+| `vit-giant-p14_dinov2-pre_3rdparty`\* | 1136.00 | 1784.00 | [config](vit-giant-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/dinov2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{oquab2023dinov2,
+ title={DINOv2: Learning Robust Visual Features without Supervision},
+ author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
+ journal={arXiv:2304.07193},
+ year={2023}
+}
+```
diff --git a/configs/dinov2/metafile.yml b/configs/dinov2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..48f205a24abf006019fa00041bfc8cb5a138aa55
--- /dev/null
+++ b/configs/dinov2/metafile.yml
@@ -0,0 +1,73 @@
+Collections:
+ - Name: DINOv2
+ Metadata:
+ Architecture:
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ Paper:
+ Title: 'DINOv2: Learning Robust Visual Features without Supervision'
+ URL: https://arxiv.org/abs/2304.07193
+ README: configs/dinov2/README.md
+ Code:
+ URL: null
+ Version: null
+
+Models:
+ - Name: vit-small-p14_dinov2-pre_3rdparty
+ Metadata:
+ FLOPs: 46762000000
+ Parameters: 22056000
+ Training Data:
+ - LVD-142M
+ In Collection: DINOv2
+ Results: null
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth
+ Config: configs/dinov2/vit-small-p14_dinov2-pre_headless.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth
+ Code: https://github.com/facebookresearch/dinov2
+
+ - Name: vit-base-p14_dinov2-pre_3rdparty
+ Metadata:
+ FLOPs: 152000000000
+ Parameters: 86580000
+ Training Data:
+ - LVD-142M
+ In Collection: DINOv2
+ Results: null
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth
+ Config: configs/dinov2/vit-base-p14_dinov2-pre_headless.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth
+ Code: https://github.com/facebookresearch/dinov2
+
+ - Name: vit-large-p14_dinov2-pre_3rdparty
+ Metadata:
+ FLOPs: 507000000000
+ Parameters: 304000000
+ Training Data:
+ - LVD-142M
+ In Collection: DINOv2
+ Results: null
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth
+ Config: configs/dinov2/vit-large-p14_dinov2-pre_headless.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth
+ Code: https://github.com/facebookresearch/dinov2
+
+ - Name: vit-giant-p14_dinov2-pre_3rdparty
+ Metadata:
+ FLOPs: 1784000000000
+ Parameters: 1136000000
+ Training Data:
+ - LVD-142M
+ In Collection: DINOv2
+ Results: null
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth
+ Config: configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth
+ Code: https://github.com/facebookresearch/dinov2
diff --git a/configs/dinov2/vit-base-p14_dinov2-pre_headless.py b/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..524dfe30bf47db1614d203097ffcfeeec5f68c1a
--- /dev/null
+++ b/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=518,
+ patch_size=14,
+ layer_scale_init_value=1e-5,
+ ),
+ neck=None,
+ head=None)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py b/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..a127359e5c44b6fa99482c3720cc1555432af699
--- /dev/null
+++ b/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='dinov2-giant',
+ img_size=518,
+ patch_size=14,
+ layer_scale_init_value=1e-5,
+ layer_cfgs=dict(ffn_type='swiglu_fused'),
+ ),
+ neck=None,
+ head=None)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/dinov2/vit-large-p14_dinov2-pre_headless.py b/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ec7bc68455520bef8986a8d563e5c732f3bf994
--- /dev/null
+++ b/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='large',
+ img_size=518,
+ patch_size=14,
+ layer_scale_init_value=1e-5,
+ ),
+ neck=None,
+ head=None)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/dinov2/vit-small-p14_dinov2-pre_headless.py b/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..198c5e51ab29be9202ac053c082366ec818e3982
--- /dev/null
+++ b/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='dinov2-small',
+ img_size=518,
+ patch_size=14,
+ layer_scale_init_value=1e-5,
+ ),
+ neck=None,
+ head=None)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/edgenext/README.md b/configs/edgenext/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1c9686f7d96183feb115f2bb6860688e48440ed8
--- /dev/null
+++ b/configs/edgenext/README.md
@@ -0,0 +1,80 @@
+# EdgeNeXt
+
+> [EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications](https://arxiv.org/abs/2206.10589)
+
+
+
+## Abstract
+
+In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('edgenext-xxsmall_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('edgenext-xxsmall_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/edgenext/edgenext-xxsmall_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :----------------------------------------------------------------------: |
+| `edgenext-xxsmall_3rdparty_in1k`\* | From scratch | 1.33 | 0.26 | 71.20 | 89.91 | [config](edgenext-xxsmall_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth) |
+| `edgenext-xsmall_3rdparty_in1k`\* | From scratch | 2.34 | 0.53 | 74.86 | 92.31 | [config](edgenext-xsmall_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth) |
+| `edgenext-small_3rdparty_in1k`\* | From scratch | 5.59 | 1.25 | 79.41 | 94.53 | [config](edgenext-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth) |
+| `edgenext-small-usi_3rdparty_in1k`\* | From scratch | 5.59 | 1.25 | 81.06 | 95.34 | [config](edgenext-small_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth) |
+| `edgenext-base_3rdparty_in1k`\* | From scratch | 18.51 | 3.81 | 82.48 | 96.20 | [config](edgenext-base_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth) |
+| `edgenext-base_3rdparty-usi_in1k`\* | From scratch | 18.51 | 3.81 | 83.67 | 96.70 | [config](edgenext-base_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/mmaaz60/EdgeNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Maaz2022EdgeNeXt,
+ title={EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications},
+ author={Muhammad Maaz and Abdelrahman Shaker and Hisham Cholakkal and Salman Khan and Syed Waqas Zamir and Rao Muhammad Anwer and Fahad Shahbaz Khan},
+ journal={2206.10589},
+ year={2022}
+}
+```
diff --git a/configs/edgenext/edgenext-base_8xb256-usi_in1k.py b/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..13949deaed9b09f7473fca60d4bab2012ce00c48
--- /dev/null
+++ b/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
@@ -0,0 +1,19 @@
+_base_ = ['./edgenext-base_8xb256_in1k.py']
+
+# dataset setting
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=269,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs')
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/edgenext/edgenext-base_8xb256_in1k.py b/configs/edgenext/edgenext-base_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d0a75c62fe0c771e65541937ca32b9b7ca3e9e0
--- /dev/null
+++ b/configs/edgenext/edgenext-base_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+ '../_base_/models/edgenext/edgenext-base.py',
+ '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=6e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/edgenext-small_8xb256-usi_in1k.py b/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6bc904be7f7e82eb3b9769260dd3559ee33e45f
--- /dev/null
+++ b/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
@@ -0,0 +1,19 @@
+_base_ = ['./edgenext-small_8xb256_in1k.py']
+
+# dataset setting
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=269,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs')
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/edgenext/edgenext-small_8xb256_in1k.py b/configs/edgenext/edgenext-small_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1d99bdc9f6958037306c98ba863ffb8743fa347
--- /dev/null
+++ b/configs/edgenext/edgenext-small_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+ '../_base_/models/edgenext/edgenext-small.py',
+ '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=6e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/edgenext-xsmall_8xb256_in1k.py b/configs/edgenext/edgenext-xsmall_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d2326fc9deda56d1366a4ec9cafff4e4740c24c
--- /dev/null
+++ b/configs/edgenext/edgenext-xsmall_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+ '../_base_/models/edgenext/edgenext-xsmall.py',
+ '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=6e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py b/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..507c3cb598fab10416d621e0e4cf4f78114a7918
--- /dev/null
+++ b/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+ '../_base_/models/edgenext/edgenext-xxsmall.py',
+ '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=6e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/metafile.yml b/configs/edgenext/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e69ac17405ea5081c515e8a48ff550e09675e867
--- /dev/null
+++ b/configs/edgenext/metafile.yml
@@ -0,0 +1,118 @@
+Collections:
+ - Name: EdgeNeXt
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - SDTA
+ - 1x1 Convolution
+ - Channel Self-attention
+ Paper:
+ URL: https://arxiv.org/abs/2206.10589
+ Title: 'EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications'
+ README: configs/edgenext/README.md
+ Code:
+ Version: v1.0.0rc1
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.2/mmcls/models/backbones/edgenext.py
+
+Models:
+ - Name: edgenext-xxsmall_3rdparty_in1k
+ Metadata:
+ FLOPs: 255640144
+ Parameters: 1327216
+ In Collection: EdgeNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 71.20
+ Top 5 Accuracy: 89.91
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
+ Config: configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xxsmall.pth
+ Code: https://github.com/mmaaz60/EdgeNeXt
+ - Name: edgenext-xsmall_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 529970560
+ Parameters: 2336804
+ In Collection: EdgeNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.86
+ Top 5 Accuracy: 92.31
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth
+ Config: configs/edgenext/edgenext-xsmall_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xsmall.pth
+ Code: https://github.com/mmaaz60/EdgeNeXt
+ - Name: edgenext-small_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 1249788000
+ Parameters: 5586832
+ In Collection: EdgeNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.41
+ Top 5 Accuracy: 94.53
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth
+ Config: configs/edgenext/edgenext-small_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_small.pth
+ Code: https://github.com/mmaaz60/EdgeNeXt
+ - Name: edgenext-small-usi_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 1249788000
+ Parameters: 5586832
+ In Collection: EdgeNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.06
+ Top 5 Accuracy: 95.34
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth
+ Config: configs/edgenext/edgenext-small_8xb256-usi_in1k.py
+ Converted From:
+ Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.1/edgenext_small_usi.pth
+ Code: https://github.com/mmaaz60/EdgeNeXt
+ - Name: edgenext-base_3rdparty_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 3814395280
+ Parameters: 18511292
+ In Collection: EdgeNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.48
+ Top 5 Accuracy: 96.2
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth
+ Config: configs/edgenext/edgenext-base_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base.pth
+ Code: https://github.com/mmaaz60/EdgeNeXt
+ - Name: edgenext-base_3rdparty-usi_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 3814395280
+ Parameters: 18511292
+ In Collection: EdgeNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.67
+ Top 5 Accuracy: 96.7
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth
+ Config: configs/edgenext/edgenext-base_8xb256-usi_in1k.py
+ Converted From:
+ Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base_usi.pth
+ Code: https://github.com/mmaaz60/EdgeNeXt
diff --git a/configs/efficientformer/README.md b/configs/efficientformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..537777efc0da6cba6aa198ab204945a1c3712688
--- /dev/null
+++ b/configs/efficientformer/README.md
@@ -0,0 +1,88 @@
+# EfficientFormer
+
+> [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191)
+
+
+
+## Abstract
+
+Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('efficientformer-l1_3rdparty_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('efficientformer-l1_3rdparty_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/efficientformer/efficientformer-l1_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :---------------------------------------------------------------: |
+| `efficientformer-l1_3rdparty_8xb128_in1k`\* | From scratch | 12.28 | 1.30 | 80.46 | 94.99 | [config](efficientformer-l1_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth) |
+| `efficientformer-l3_3rdparty_8xb128_in1k`\* | From scratch | 31.41 | 3.74 | 82.45 | 96.18 | [config](efficientformer-l3_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l3_3rdparty_in1k_20220915-466793d6.pth) |
+| `efficientformer-l7_3rdparty_8xb128_in1k`\* | From scratch | 82.23 | 10.16 | 83.40 | 96.60 | [config](efficientformer-l7_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l7_3rdparty_in1k_20220915-185e30af.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/snap-research/EfficientFormer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2206.01191,
+ doi = {10.48550/ARXIV.2206.01191},
+
+ url = {https://arxiv.org/abs/2206.01191},
+
+ author = {Li, Yanyu and Yuan, Geng and Wen, Yang and Hu, Eric and Evangelidis, Georgios and Tulyakov, Sergey and Wang, Yanzhi and Ren, Jian},
+
+ keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+
+ title = {EfficientFormer: Vision Transformers at MobileNet Speed},
+
+ publisher = {arXiv},
+
+ year = {2022},
+
+ copyright = {Creative Commons Attribution 4.0 International}
+}
+```
diff --git a/configs/efficientformer/efficientformer-l1_8xb128_in1k.py b/configs/efficientformer/efficientformer-l1_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f55dc653eccad42dcf95d60f9aab86460ca9117
--- /dev/null
+++ b/configs/efficientformer/efficientformer-l1_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/efficientformer-l1.py',
+ '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/efficientformer/efficientformer-l3_8xb128_in1k.py b/configs/efficientformer/efficientformer-l3_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8be5efae1ad93f175c25eabc6361a20c1ece76f
--- /dev/null
+++ b/configs/efficientformer/efficientformer-l3_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './efficientformer-l1_8xb128_in1k.py'
+
+model = dict(backbone=dict(arch='l3'), head=dict(in_channels=512))
diff --git a/configs/efficientformer/efficientformer-l7_8xb128_in1k.py b/configs/efficientformer/efficientformer-l7_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2252652efe55840880ad64cde121a51614f4e84
--- /dev/null
+++ b/configs/efficientformer/efficientformer-l7_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './efficientformer-l1_8xb128_in1k.py'
+
+model = dict(backbone=dict(arch='l7'), head=dict(in_channels=768))
diff --git a/configs/efficientformer/metafile.yml b/configs/efficientformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5c70f07ec52f956e0644d4e25d4162ed009ac72a
--- /dev/null
+++ b/configs/efficientformer/metafile.yml
@@ -0,0 +1,67 @@
+Collections:
+ - Name: EfficientFormer
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Pooling
+ - 1x1 Convolution
+ - LayerScale
+ - MetaFormer
+ Paper:
+ URL: https://arxiv.org/abs/2206.01191
+ Title: "EfficientFormer: Vision Transformers at MobileNet Speed"
+ README: configs/efficientformer/README.md
+ Code:
+ Version: v1.0.0rc1
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/efficientformer/metafile.yml
+
+Models:
+ - Name: efficientformer-l1_3rdparty_8xb128_in1k
+ Metadata:
+ FLOPs: 1304601088 # 1.3G
+ Parameters: 12278696 # 12M
+ In Collection: EfficientFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.46
+ Top 5 Accuracy: 94.99
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth
+ Config: configs/efficientformer/efficientformer-l1_8xb128_in1k.py
+ Converted From:
+ Weights: https://drive.google.com/file/d/11SbX-3cfqTOc247xKYubrAjBiUmr818y/view?usp=sharing
+ Code: https://github.com/snap-research/EfficientFormer
+ - Name: efficientformer-l3_3rdparty_8xb128_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 3737045760 # 3.7G
+ Parameters: 31406000 # 31M
+ In Collection: EfficientFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.45
+ Top 5 Accuracy: 96.18
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l3_3rdparty_in1k_20220915-466793d6.pth
+ Config: configs/efficientformer/efficientformer-l3_8xb128_in1k.py
+ Converted From:
+ Weights: https://drive.google.com/file/d/1OyyjKKxDyMj-BcfInp4GlDdwLu3hc30m/view?usp=sharing
+ Code: https://github.com/snap-research/EfficientFormer
+ - Name: efficientformer-l7_3rdparty_8xb128_in1k
+ Metadata:
+ FLOPs: 10163951616 # 10.2G
+ Parameters: 82229328 # 82M
+ In Collection: EfficientFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.40
+ Top 5 Accuracy: 96.60
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l7_3rdparty_in1k_20220915-185e30af.pth
+ Config: configs/efficientformer/efficientformer-l7_8xb128_in1k.py
+ Converted From:
+ Weights: https://drive.google.com/file/d/1cVw-pctJwgvGafeouynqWWCwgkcoFMM5/view?usp=sharing
+ Code: https://github.com/snap-research/EfficientFormer
diff --git a/configs/efficientnet/README.md b/configs/efficientnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c7b7b76ab5db29c3f9bc54eaefffdcf9cda4c13a
--- /dev/null
+++ b/configs/efficientnet/README.md
@@ -0,0 +1,122 @@
+# EfficientNet
+
+> [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946v5)
+
+
+
+## Introduction
+
+EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.
+
+EfficientNets are based on AutoML and Compound Scaling. In particular, we first use [AutoML MNAS Mobile framework](https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html) to develop a mobile-size baseline network, named as EfficientNet-B0; Then, we use the compound scaling method to scale up this baseline to obtain EfficientNet-B1 to B7.
+
+
+

+
+
+## Abstract
+
+
+
+Click to show the detailed Abstract
+
+
+Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('efficientnet-b0_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('efficientnet-b0_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/efficientnet/efficientnet-b0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :-------------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :----------------------------------------------------: |
+| `efficientnet-b0_3rdparty_8xb32_in1k`\* | From scratch | 5.29 | 0.42 | 76.74 | 93.17 | [config](efficientnet-b0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth) |
+| `efficientnet-b0_3rdparty_8xb32-aa_in1k`\* | From scratch | 5.29 | 0.42 | 77.26 | 93.41 | [config](efficientnet-b0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa_in1k_20220119-8d939117.pth) |
+| `efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k`\* | From scratch | 5.29 | 0.42 | 77.53 | 93.61 | [config](efficientnet-b0_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k_20220119-26434485.pth) |
+| `efficientnet-b0_3rdparty-ra-noisystudent_in1k`\* | From scratch | 5.29 | 0.42 | 77.63 | 94.00 | [config](efficientnet-b0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty-ra-noisystudent_in1k_20221103-75cd08d3.pth) |
+| `efficientnet-b1_3rdparty_8xb32_in1k`\* | From scratch | 7.79 | 0.74 | 78.68 | 94.28 | [config](efficientnet-b1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32_in1k_20220119-002556d9.pth) |
+| `efficientnet-b1_3rdparty_8xb32-aa_in1k`\* | From scratch | 7.79 | 0.74 | 79.20 | 94.42 | [config](efficientnet-b1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa_in1k_20220119-619d8ae3.pth) |
+| `efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k`\* | From scratch | 7.79 | 0.74 | 79.52 | 94.43 | [config](efficientnet-b1_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k_20220119-5715267d.pth) |
+| `efficientnet-b1_3rdparty-ra-noisystudent_in1k`\* | From scratch | 7.79 | 0.74 | 81.44 | 95.83 | [config](efficientnet-b1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty-ra-noisystudent_in1k_20221103-756bcbc0.pth) |
+| `efficientnet-b2_3rdparty_8xb32_in1k`\* | From scratch | 9.11 | 1.07 | 79.64 | 94.80 | [config](efficientnet-b2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32_in1k_20220119-ea374a30.pth) |
+| `efficientnet-b2_3rdparty_8xb32-aa_in1k`\* | From scratch | 9.11 | 1.07 | 80.21 | 94.96 | [config](efficientnet-b2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa_in1k_20220119-dd61e80b.pth) |
+| `efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k`\* | From scratch | 9.11 | 1.07 | 80.45 | 95.07 | [config](efficientnet-b2_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k_20220119-1655338a.pth) |
+| `efficientnet-b2_3rdparty-ra-noisystudent_in1k`\* | From scratch | 9.11 | 1.07 | 82.47 | 96.23 | [config](efficientnet-b2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty-ra-noisystudent_in1k_20221103-301ed299.pth) |
+| `efficientnet-b3_3rdparty_8xb32_in1k`\* | From scratch | 12.23 | 1.95 | 81.01 | 95.34 | [config](efficientnet-b3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32_in1k_20220119-4b4d7487.pth) |
+| `efficientnet-b3_3rdparty_8xb32-aa_in1k`\* | From scratch | 12.23 | 1.95 | 81.58 | 95.67 | [config](efficientnet-b3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa_in1k_20220119-5b4887a0.pth) |
+| `efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k`\* | From scratch | 12.23 | 1.95 | 81.81 | 95.69 | [config](efficientnet-b3_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k_20220119-53b41118.pth) |
+| `efficientnet-b3_3rdparty-ra-noisystudent_in1k`\* | From scratch | 12.23 | 1.95 | 84.02 | 96.89 | [config](efficientnet-b3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty-ra-noisystudent_in1k_20221103-a4ab5fd6.pth) |
+| `efficientnet-b4_3rdparty_8xb32_in1k`\* | From scratch | 19.34 | 4.66 | 82.57 | 96.09 | [config](efficientnet-b4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32_in1k_20220119-81fd4077.pth) |
+| `efficientnet-b4_3rdparty_8xb32-aa_in1k`\* | From scratch | 19.34 | 4.66 | 82.95 | 96.26 | [config](efficientnet-b4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa_in1k_20220119-45b8bd2b.pth) |
+| `efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k`\* | From scratch | 19.34 | 4.66 | 83.25 | 96.44 | [config](efficientnet-b4_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k_20220119-38c2238c.pth) |
+| `efficientnet-b4_3rdparty-ra-noisystudent_in1k`\* | From scratch | 19.34 | 4.66 | 85.25 | 97.52 | [config](efficientnet-b4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty-ra-noisystudent_in1k_20221103-16ba8a2d.pth) |
+| `efficientnet-b5_3rdparty_8xb32_in1k`\* | From scratch | 30.39 | 10.80 | 83.18 | 96.47 | [config](efficientnet-b5_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32_in1k_20220119-e9814430.pth) |
+| `efficientnet-b5_3rdparty_8xb32-aa_in1k`\* | From scratch | 30.39 | 10.80 | 83.82 | 96.76 | [config](efficientnet-b5_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa_in1k_20220119-2cab8b78.pth) |
+| `efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k`\* | From scratch | 30.39 | 10.80 | 84.21 | 96.98 | [config](efficientnet-b5_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k_20220119-f57a895a.pth) |
+| `efficientnet-b5_3rdparty-ra-noisystudent_in1k`\* | From scratch | 30.39 | 10.80 | 86.08 | 97.75 | [config](efficientnet-b5_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty-ra-noisystudent_in1k_20221103-111a185f.pth) |
+| `efficientnet-b6_3rdparty_8xb32-aa_in1k`\* | From scratch | 43.04 | 19.97 | 84.05 | 96.82 | [config](efficientnet-b6_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa_in1k_20220119-45b03310.pth) |
+| `efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k`\* | From scratch | 43.04 | 19.97 | 84.74 | 97.14 | [config](efficientnet-b6_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k_20220119-bfe3485e.pth) |
+| `efficientnet-b6_3rdparty-ra-noisystudent_in1k`\* | From scratch | 43.04 | 19.97 | 86.47 | 97.87 | [config](efficientnet-b6_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty-ra-noisystudent_in1k_20221103-7de7d2cc.pth) |
+| `efficientnet-b7_3rdparty_8xb32-aa_in1k`\* | From scratch | 66.35 | 39.32 | 84.38 | 96.88 | [config](efficientnet-b7_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa_in1k_20220119-bf03951c.pth) |
+| `efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k`\* | From scratch | 66.35 | 39.32 | 85.14 | 97.23 | [config](efficientnet-b7_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k_20220119-c6dbff10.pth) |
+| `efficientnet-b7_3rdparty-ra-noisystudent_in1k`\* | From scratch | 66.35 | 39.32 | 86.83 | 98.08 | [config](efficientnet-b7_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty-ra-noisystudent_in1k_20221103-a82894bc.pth) |
+| `efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k`\* | From scratch | 87.41 | 65.00 | 85.38 | 97.28 | [config](efficientnet-b8_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k_20220119-297ce1b7.pth) |
+| `efficientnet-l2_3rdparty-ra-noisystudent_in1k-800px`\* | From scratch | 480.31 | 174.20 | 88.33 | 98.65 | [config](efficientnet-l2_8xb8_in1k-800px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k_20221103-be73be13.pth) |
+| `efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px`\* | From scratch | 480.31 | 484.98 | 88.18 | 98.55 | [config](efficientnet-l2_8xb32_in1k-475px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px_20221103-5a0d8058.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{tan2019efficientnet,
+ title={Efficientnet: Rethinking model scaling for convolutional neural networks},
+ author={Tan, Mingxing and Le, Quoc},
+ booktitle={International Conference on Machine Learning},
+ pages={6105--6114},
+ year={2019},
+ organization={PMLR}
+}
+```
diff --git a/configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..369d0a43d1950de5da47789d0f28465c95fdaae5
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_b0.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b0_8xb32_in1k.py b/configs/efficientnet/efficientnet-b0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4263da196430b310fae4da3273d13bb66e89075
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b0_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_b0.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0405cf5f84eeedf0a2e761670bc600d9f82401af
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_b1.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=240),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=240),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b1_8xb32_in1k.py b/configs/efficientnet/efficientnet-b1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5bf2e8076d81c97adb4d1883cfbdb5f645b6b93
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b1_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_b1.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=240),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=240),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..da3f23b84c6f7fc8b5d415b90ca2f69f4d6e58c4
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_b2.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=260),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=260),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b2_8xb32_in1k.py b/configs/efficientnet/efficientnet-b2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..060a2ad3ea9247131c4207d738dce0bfacd16a16
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b2_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_b2.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=260),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=260),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..55729a9c2258352a6ed981dff25777b0acaaae85
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_b3.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=300),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=300),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b3_8xb32_in1k.py b/configs/efficientnet/efficientnet-b3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d84de5a79316ab6d7f73e45f266fbaec43ed9629
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b3_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_b3.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=300),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=300),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4dbfb212fd03d508b678a684f4d8b6854f648c6
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_b4.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=380),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=380),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b4_8xb32_in1k.py b/configs/efficientnet/efficientnet-b4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..08e246c3851d12ee067469d9afb10fc7f0933de7
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b4_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_b4.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=380),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=380),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c646da43d4baf23cebfc6835ec400dba6d5bd35
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_b5.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=456),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=456),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b5_8xb32_in1k.py b/configs/efficientnet/efficientnet-b5_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..af4fa4b8fcbce99ae1ac163c72cec11789109482
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b5_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_b5.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=456),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=456),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd15054928b56bdae2c3a2ef479e96826824fe2b
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_b6.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=528),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=528),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b6_8xb32_in1k.py b/configs/efficientnet/efficientnet-b6_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fae02aed6dd5b8fbb1b42140856333b771c927d1
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b6_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_b6.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=528),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=528),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..687dfd261d73d84061b289c955cb0260059999b2
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_b7.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=600),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=600),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b7_8xb32_in1k.py b/configs/efficientnet/efficientnet-b7_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d783bb30bf1939aa1c8c9a010e5733ae7b1342b
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b7_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_b7.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=600),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=600),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..07d3692baa9b9f3d10109e63d1da5e74cc62ee26
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_b8.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=672),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=672),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b8_8xb32_in1k.py b/configs/efficientnet/efficientnet-b8_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..868986f52488233b36631c13d66d8da2aac8c348
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b8_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_b8.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=672),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=672),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-em_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-em_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9de3b27fb31a1382c08a646987b7cf4d996e77f4
--- /dev/null
+++ b/configs/efficientnet/efficientnet-em_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/models/efficientnet_em.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=240),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=240),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-es_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-es_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e643d55089b932732d47c5dbe5734c2085a2fb3e
--- /dev/null
+++ b/configs/efficientnet/efficientnet-es_8xb32-01norm_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_es.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py b/configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py
new file mode 100644
index 0000000000000000000000000000000000000000..560695144f50194c00bc78707c8ddf7288e4cd52
--- /dev/null
+++ b/configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_l2.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=475),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=475),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py b/configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py
new file mode 100644
index 0000000000000000000000000000000000000000..61bddfa735117db68377a224f72c1160a046ae1c
--- /dev/null
+++ b/configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/efficientnet_l2.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=800),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=800),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=8, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=8, dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(batch_size=8, dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/metafile.yml b/configs/efficientnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..21130c4ff1d64895295372acac18961a4f90bd7c
--- /dev/null
+++ b/configs/efficientnet/metafile.yml
@@ -0,0 +1,551 @@
+Collections:
+ - Name: EfficientNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - 1x1 Convolution
+ - Average Pooling
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - Inverted Residual Block
+ - RMSProp
+ - Squeeze-and-Excitation Block
+ - Swish
+ Paper:
+ URL: https://arxiv.org/abs/1905.11946v5
+ Title: "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks"
+ README: configs/efficientnet/README.md
+ Code:
+ Version: v0.20.1
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/efficientnet.py
+
+Models:
+ - Name: efficientnet-b0_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 420592480
+ Parameters: 5288548
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.74
+ Top 5 Accuracy: 93.17
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth
+ Config: configs/efficientnet/efficientnet-b0_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b0.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b0_3rdparty_8xb32-aa_in1k
+ Metadata:
+ FLOPs: 420592480
+ Parameters: 5288548
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.26
+ Top 5 Accuracy: 93.41
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa_in1k_20220119-8d939117.pth
+ Config: configs/efficientnet/efficientnet-b0_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b0.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k
+ Metadata:
+ FLOPs: 420592480
+ Parameters: 5288548
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.53
+ Top 5 Accuracy: 93.61
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k_20220119-26434485.pth
+ Config: configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b0.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b0_3rdparty-ra-noisystudent_in1k
+ Metadata:
+ FLOPs: 420592480
+ Parameters: 5288548
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.63
+ Top 5 Accuracy: 94.00
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty-ra-noisystudent_in1k_20221103-75cd08d3.pth
+ Config: configs/efficientnet/efficientnet-b0_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b0.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b1_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 744059920
+ Parameters: 7794184
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.68
+ Top 5 Accuracy: 94.28
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32_in1k_20220119-002556d9.pth
+ Config: configs/efficientnet/efficientnet-b1_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b1.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b1_3rdparty_8xb32-aa_in1k
+ Metadata:
+ FLOPs: 744059920
+ Parameters: 7794184
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.20
+ Top 5 Accuracy: 94.42
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa_in1k_20220119-619d8ae3.pth
+ Config: configs/efficientnet/efficientnet-b1_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b1.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k
+ Metadata:
+ FLOPs: 744059920
+ Parameters: 7794184
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.52
+ Top 5 Accuracy: 94.43
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k_20220119-5715267d.pth
+ Config: configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b1.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b1_3rdparty-ra-noisystudent_in1k
+ Metadata:
+ FLOPs: 744059920
+ Parameters: 7794184
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.44
+ Top 5 Accuracy: 95.83
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty-ra-noisystudent_in1k_20221103-756bcbc0.pth
+ Config: configs/efficientnet/efficientnet-b1_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b1.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b2_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 1066620392
+ Parameters: 9109994
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.64
+ Top 5 Accuracy: 94.80
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32_in1k_20220119-ea374a30.pth
+ Config: configs/efficientnet/efficientnet-b2_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b2.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b2_3rdparty_8xb32-aa_in1k
+ Metadata:
+ FLOPs: 1066620392
+ Parameters: 9109994
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.21
+ Top 5 Accuracy: 94.96
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa_in1k_20220119-dd61e80b.pth
+ Config: configs/efficientnet/efficientnet-b2_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b2.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k
+ Metadata:
+ FLOPs: 1066620392
+ Parameters: 9109994
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.45
+ Top 5 Accuracy: 95.07
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k_20220119-1655338a.pth
+ Config: configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b2.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b2_3rdparty-ra-noisystudent_in1k
+ Metadata:
+ FLOPs: 1066620392
+ Parameters: 9109994
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.47
+ Top 5 Accuracy: 96.23
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty-ra-noisystudent_in1k_20221103-301ed299.pth
+ Config: configs/efficientnet/efficientnet-b2_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b2.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b3_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 1953798216
+ Parameters: 12233232
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.01
+ Top 5 Accuracy: 95.34
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32_in1k_20220119-4b4d7487.pth
+ Config: configs/efficientnet/efficientnet-b3_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b3.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b3_3rdparty_8xb32-aa_in1k
+ Metadata:
+ FLOPs: 1953798216
+ Parameters: 12233232
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.58
+ Top 5 Accuracy: 95.67
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa_in1k_20220119-5b4887a0.pth
+ Config: configs/efficientnet/efficientnet-b3_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b3.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k
+ Metadata:
+ FLOPs: 1953798216
+ Parameters: 12233232
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.81
+ Top 5 Accuracy: 95.69
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k_20220119-53b41118.pth
+ Config: configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b3.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b3_3rdparty-ra-noisystudent_in1k
+ Metadata:
+ FLOPs: 1953798216
+ Parameters: 12233232
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.02
+ Top 5 Accuracy: 96.89
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty-ra-noisystudent_in1k_20221103-a4ab5fd6.pth
+ Config: configs/efficientnet/efficientnet-b3_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b3.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b4_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 4659080176
+ Parameters: 19341616
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.57
+ Top 5 Accuracy: 96.09
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32_in1k_20220119-81fd4077.pth
+ Config: configs/efficientnet/efficientnet-b4_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b4.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b4_3rdparty_8xb32-aa_in1k
+ Metadata:
+ FLOPs: 4659080176
+ Parameters: 19341616
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.95
+ Top 5 Accuracy: 96.26
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa_in1k_20220119-45b8bd2b.pth
+ Config: configs/efficientnet/efficientnet-b4_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b4.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k
+ Metadata:
+ FLOPs: 4659080176
+ Parameters: 19341616
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.25
+ Top 5 Accuracy: 96.44
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k_20220119-38c2238c.pth
+ Config: configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b4.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b4_3rdparty-ra-noisystudent_in1k
+ Metadata:
+ FLOPs: 4659080176
+ Parameters: 19341616
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.25
+ Top 5 Accuracy: 97.52
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty-ra-noisystudent_in1k_20221103-16ba8a2d.pth
+ Config: configs/efficientnet/efficientnet-b4_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b4.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b5_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 10799472560
+ Parameters: 30389784
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.18
+ Top 5 Accuracy: 96.47
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32_in1k_20220119-e9814430.pth
+ Config: configs/efficientnet/efficientnet-b5_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b5.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b5_3rdparty_8xb32-aa_in1k
+ Metadata:
+ FLOPs: 10799472560
+ Parameters: 30389784
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.82
+ Top 5 Accuracy: 96.76
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa_in1k_20220119-2cab8b78.pth
+ Config: configs/efficientnet/efficientnet-b5_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b5.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k
+ Metadata:
+ FLOPs: 10799472560
+ Parameters: 30389784
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.21
+ Top 5 Accuracy: 96.98
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k_20220119-f57a895a.pth
+ Config: configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b5.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b5_3rdparty-ra-noisystudent_in1k
+ Metadata:
+ FLOPs: 10799472560
+ Parameters: 30389784
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.08
+ Top 5 Accuracy: 97.75
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty-ra-noisystudent_in1k_20221103-111a185f.pth
+ Config: configs/efficientnet/efficientnet-b5_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b5.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b6_3rdparty_8xb32-aa_in1k
+ Metadata:
+ FLOPs: 19971777560
+ Parameters: 43040704
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.05
+ Top 5 Accuracy: 96.82
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa_in1k_20220119-45b03310.pth
+ Config: configs/efficientnet/efficientnet-b6_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b6.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k
+ Metadata:
+ FLOPs: 19971777560
+ Parameters: 43040704
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.74
+ Top 5 Accuracy: 97.14
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k_20220119-bfe3485e.pth
+ Config: configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b6.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b6_3rdparty-ra-noisystudent_in1k
+ Metadata:
+ FLOPs: 19971777560
+ Parameters: 43040704
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.47
+ Top 5 Accuracy: 97.87
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty-ra-noisystudent_in1k_20221103-7de7d2cc.pth
+ Config: configs/efficientnet/efficientnet-b6_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b6.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b7_3rdparty_8xb32-aa_in1k
+ Metadata:
+ FLOPs: 39316473392
+ Parameters: 66347960
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.38
+ Top 5 Accuracy: 96.88
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa_in1k_20220119-bf03951c.pth
+ Config: configs/efficientnet/efficientnet-b7_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b7.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k
+ Metadata:
+ FLOPs: 39316473392
+ Parameters: 66347960
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.14
+ Top 5 Accuracy: 97.23
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k_20220119-c6dbff10.pth
+ Config: configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b7.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b7_3rdparty-ra-noisystudent_in1k
+ Metadata:
+ FLOPs: 39316473392
+ Parameters: 66347960
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.83
+ Top 5 Accuracy: 98.08
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty-ra-noisystudent_in1k_20221103-a82894bc.pth
+ Config: configs/efficientnet/efficientnet-b7_8xb32_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b7.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k
+ Metadata:
+ FLOPs: 64999827816
+ Parameters: 87413142
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.38
+ Top 5 Accuracy: 97.28
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k_20220119-297ce1b7.pth
+ Config: configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b8.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-l2_3rdparty-ra-noisystudent_in1k-800px
+ Metadata:
+ FLOPs: 174203533416
+ Parameters: 480309308
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 88.33
+ Top 5 Accuracy: 98.65
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k_20221103-be73be13.pth
+ Config: configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-l2.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+ - Name: efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px
+ Metadata:
+ FLOPs: 484984099280
+ Parameters: 480309308
+ In Collection: EfficientNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 88.18
+ Top 5 Accuracy: 98.55
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px_20221103-5a0d8058.pth
+ Config: configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py
+ Converted From:
+ Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-l2_475.tar.gz
+ Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
diff --git a/configs/efficientnet_v2/README.md b/configs/efficientnet_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..965421823e7fe3e6cf8504d717864bf8a499ab2e
--- /dev/null
+++ b/configs/efficientnet_v2/README.md
@@ -0,0 +1,98 @@
+# EfficientNetV2
+
+> [EfficientNetV2: Smaller Models and Faster Training](https://arxiv.org/abs/2104.00298)
+
+
+
+## Abstract
+
+This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop this family of models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. The models were searched from the search space enriched with new ops such as Fused-MBConv. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller. Our training can be further sped up by progressively increasing the image size during training, but it often causes a drop in accuracy. To compensate for this accuracy drop, we propose to adaptively adjust regularization (e.g., dropout and data augmentation) as well, such that we can achieve both fast training and good accuracy. With progressive learning, our EfficientNetV2 significantly outperforms previous models on ImageNet and CIFAR/Cars/Flowers datasets. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. Code will be available at https://github.com/google/automl/tree/master/efficientnetv2.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('efficientnetv2-b0_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('efficientnetv2-b0_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b0_3rdparty_in1k_20221221-9ef6e736.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :----------------------------------- | :--------: | :-------: | :----------------------------------------: | :-----------------------------------------------------------------------------------------------------: |
+| `efficientnetv2-s_3rdparty_in21k`\* | 48.16 | 3.31 | [config](efficientnetv2-s_8xb32_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in21k_20221220-c0572b56.pth) |
+| `efficientnetv2-m_3rdparty_in21k`\* | 80.84 | 5.86 | [config](efficientnetv2-m_8xb32_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in21k_20221220-073e944c.pth) |
+| `efficientnetv2-l_3rdparty_in21k`\* | 145.22 | 13.11 | [config](efficientnetv2-l_8xb32_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in21k_20221220-f28f91e1.pth) |
+| `efficientnetv2-xl_3rdparty_in21k`\* | 234.82 | 18.86 | [config](efficientnetv2-xl_8xb32_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_3rdparty_in21k_20221220-b2c9329c.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :-------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------------: | :---------------------------------------------------------: |
+| `efficientnetv2-b0_3rdparty_in1k`\* | From scratch | 7.14 | 0.92 | 78.52 | 94.44 | [config](efficientnetv2-b0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b0_3rdparty_in1k_20221221-9ef6e736.pth) |
+| `efficientnetv2-b1_3rdparty_in1k`\* | From scratch | 8.14 | 1.44 | 79.80 | 94.89 | [config](efficientnetv2-b1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b1_3rdparty_in1k_20221221-6955d9ce.pth) |
+| `efficientnetv2-b2_3rdparty_in1k`\* | From scratch | 10.10 | 1.99 | 80.63 | 95.30 | [config](efficientnetv2-b2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b2_3rdparty_in1k_20221221-74f7d493.pth) |
+| `efficientnetv2-b3_3rdparty_in1k`\* | From scratch | 14.36 | 3.50 | 82.03 | 95.88 | [config](efficientnetv2-b3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b3_3rdparty_in1k_20221221-b6f07a36.pth) |
+| `efficientnetv2-s_3rdparty_in1k`\* | From scratch | 21.46 | 9.72 | 83.82 | 96.67 | [config](efficientnetv2-s_8xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in1k_20221220-f0eaff9d.pth) |
+| `efficientnetv2-m_3rdparty_in1k`\* | From scratch | 54.14 | 26.88 | 85.01 | 97.26 | [config](efficientnetv2-m_8xb32_in1k-480px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in1k_20221220-9dc0c729.pth) |
+| `efficientnetv2-l_3rdparty_in1k`\* | From scratch | 118.52 | 60.14 | 85.43 | 97.31 | [config](efficientnetv2-l_8xb32_in1k-480px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in1k_20221220-5c3bac0f.pth) |
+| `efficientnetv2-s_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 21.46 | 9.72 | 84.29 | 97.26 | [config](efficientnetv2-s_8xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_in21k-pre-3rdparty_in1k_20221220-7a7c8475.pth) |
+| `efficientnetv2-m_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 54.14 | 26.88 | 85.47 | 97.76 | [config](efficientnetv2-m_8xb32_in1k-480px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_in21k-pre-3rdparty_in1k_20221220-a1013a04.pth) |
+| `efficientnetv2-l_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 118.52 | 60.14 | 86.31 | 97.99 | [config](efficientnetv2-l_8xb32_in1k-480px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_in21k-pre-3rdparty_in1k_20221220-63df0efd.pth) |
+| `efficientnetv2-xl_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 208.12 | 98.34 | 86.39 | 97.83 | [config](efficientnetv2-xl_8xb32_in1k-512px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_in21k-pre-3rdparty_in1k_20221220-583ac18b.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{tan2021efficientnetv2,
+ title={Efficientnetv2: Smaller models and faster training},
+ author={Tan, Mingxing and Le, Quoc},
+ booktitle={International Conference on Machine Learning},
+ pages={10096--10106},
+ year={2021},
+ organization={PMLR}
+}
+```
diff --git a/configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4dc23d4904ef87f3ca581dc022a65f8d9c925038
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+ '../_base_/models/efficientnet_v2/efficientnetv2_b0.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=192,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=224, crop_padding=0),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa187ff1503531732b10e2b178751361e4a4de2d
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py
@@ -0,0 +1,21 @@
+_base_ = ['./efficientnetv2-b0_8xb32_in1k.py']
+
+# model setting
+model = dict(backbone=dict(arch='b1'), head=dict(in_channels=1280, ))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=192),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=240, crop_padding=0),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ff5530d1dbac739295c6fbc1f61fa6b36d8aa65
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py
@@ -0,0 +1,21 @@
+_base_ = ['./efficientnetv2-b0_8xb32_in1k.py']
+
+# model setting
+model = dict(backbone=dict(arch='b2'), head=dict(in_channels=1408, ))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=208),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=260, crop_padding=0),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..84fb29a55400a44af414b909c49806381f9564b9
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py
@@ -0,0 +1,21 @@
+_base_ = ['./efficientnetv2-b0_8xb32_in1k.py']
+
+# model setting
+model = dict(backbone=dict(arch='b3'), head=dict(in_channels=1536, ))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=240),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=300, crop_padding=0),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
new file mode 100644
index 0000000000000000000000000000000000000000..c3606cf07086f6a8f0580183e6f94d9e1950dae3
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
@@ -0,0 +1,23 @@
+_base_ = [
+ 'efficientnetv2-s_8xb32_in1k-384px.py',
+]
+
+# model setting
+model = dict(backbone=dict(arch='l'), )
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=384, crop_padding=0),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=480, crop_padding=0),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..179c72075f6f5caa4fc551fee0e3462db6dcba18
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py
@@ -0,0 +1,4 @@
+_base_ = ['./efficientnetv2-s_8xb32_in21k.py']
+
+# model setting
+model = dict(backbone=dict(arch='l'), )
diff --git a/configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7bdd9be3b8e45ccb512f86049df482306ad91d9
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
@@ -0,0 +1,23 @@
+_base_ = [
+ 'efficientnetv2-s_8xb32_in1k-384px.py',
+]
+
+# model setting
+model = dict(backbone=dict(arch='m'), )
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=384, crop_padding=0),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=480, crop_padding=0),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f04d616376aa523526425c595904e64db0214ecc
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py
@@ -0,0 +1,4 @@
+_base_ = ['./efficientnetv2-s_8xb32_in21k.py']
+
+# model setting
+model = dict(backbone=dict(arch='m'), )
diff --git a/configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bdee636a20bf50cff4126cd50087724b7a9072f
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/models/efficientnet_v2/efficientnetv2_s.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=300, crop_padding=0),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=384, crop_padding=0),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..54f8a5af4eb92f8de1d7e5f488a8b222afda9239
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py
@@ -0,0 +1,43 @@
+_base_ = [
+ '../_base_/models/efficientnet_v2/efficientnetv2_s.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21843))
+
+# dataset settings
+dataset_type = 'ImageNet21k'
+data_preprocessor = dict(
+ num_classes=21843,
+ # RGB format normalization parameters
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=224, crop_padding=0),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py
new file mode 100644
index 0000000000000000000000000000000000000000..18f56ff063b3dd1eee15f81718cd88cd83eeb9df
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py
@@ -0,0 +1,23 @@
+_base_ = [
+ 'efficientnetv2-s_8xb32_in1k-384px.py',
+]
+
+# model setting
+model = dict(backbone=dict(arch='xl'), )
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetRandomCrop', scale=384, crop_padding=0),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=512, crop_padding=0),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2ee84cb32f7b83bf6d950a92088e983063ce049
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py
@@ -0,0 +1,4 @@
+_base_ = ['./efficientnetv2-s_8xb32_in21k.py']
+
+# model setting
+model = dict(backbone=dict(arch='xl'), )
diff --git a/configs/efficientnet_v2/metafile.yml b/configs/efficientnet_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6c927dce99ad0bf9c6e5555c4e9496e2613960d3
--- /dev/null
+++ b/configs/efficientnet_v2/metafile.yml
@@ -0,0 +1,255 @@
+Collections:
+ - Name: EfficientNetV2
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - 1x1 Convolution
+ - Average Pooling
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - Inverted Residual Block
+ - RMSProp
+ - Squeeze-and-Excitation Block
+ - Swish
+ Paper:
+ URL: https://arxiv.org/abs/2104.00298
+ Title: "EfficientNetV2: Smaller Models and Faster Training"
+ README: configs/efficientnet_v2/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+ Version: v1.0.0rc4
+
+Models:
+ - Name: efficientnetv2-b0_3rdparty_in1k
+ Metadata:
+ FLOPs: 919843360
+ Parameters: 7139704
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.52
+ Top 5 Accuracy: 94.44
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b0_3rdparty_in1k_20221221-9ef6e736.pth
+ Config: configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b0-c7cc451f.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-b1_3rdparty_in1k
+ Metadata:
+ FLOPs: 1438287552
+ Parameters: 8141052
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.80
+ Top 5 Accuracy: 94.89
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b1_3rdparty_in1k_20221221-6955d9ce.pth
+ Config: configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b1-be6e41b0.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-b2_3rdparty_in1k
+ Metadata:
+ FLOPs: 1986433080
+ Parameters: 10096086
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.63
+ Top 5 Accuracy: 95.30
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b2_3rdparty_in1k_20221221-74f7d493.pth
+ Config: configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b2-847de54e.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-b3_3rdparty_in1k
+ Metadata:
+ FLOPs: 3498068400
+ Parameters: 14358406
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.03
+ Top 5 Accuracy: 95.88
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b3_3rdparty_in1k_20221221-b6f07a36.pth
+ Config: configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b3-57773f13.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-s_3rdparty_in1k
+ Metadata:
+ FLOPs: 9719420928
+ Parameters: 21458488
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.82
+ Top 5 Accuracy: 96.67
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in1k_20221220-f0eaff9d.pth
+ Config: configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s-eb54923e.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-m_3rdparty_in1k
+ Metadata:
+ FLOPs: 26880363584
+ Parameters: 54139356
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.01
+ Top 5 Accuracy: 97.26
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in1k_20221220-9dc0c729.pth
+ Config: configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_m-cc09e0cd.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-l_3rdparty_in1k
+ Metadata:
+ FLOPs: 60142387008
+ Parameters: 118515272
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.43
+ Top 5 Accuracy: 97.31
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in1k_20221220-5c3bac0f.pth
+ Config: configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_l-d664b728.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-s_in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 9719420928
+ Parameters: 21458488
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.29
+ Top 5 Accuracy: 97.26
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_in21k-pre-3rdparty_in1k_20221220-7a7c8475.pth
+ Config: configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s_21ft1k-d7dafa41.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-m_in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 26880363584
+ Parameters: 54139356
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.47
+ Top 5 Accuracy: 97.76
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_in21k-pre-3rdparty_in1k_20221220-a1013a04.pth
+ Config: configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_m_21ft1k-bf41664a.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-l_in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 60142387008
+ Parameters: 118515272
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.31
+ Top 5 Accuracy: 97.99
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_in21k-pre-3rdparty_in1k_20221220-63df0efd.pth
+ Config: configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_l_21ft1k-60127a9d.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-xl_in21k-pre_3rdparty_in1k
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 98341230592
+ Parameters: 208119808
+ In Collection: EfficientNetV2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.39
+ Top 5 Accuracy: 97.83
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_in21k-pre-3rdparty_in1k_20221220-583ac18b.pth
+ Config: configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_xl_in21ft1k-06c35c48.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-s_3rdparty_in21k
+ Metadata:
+ FLOPs: 3309720768
+ Parameters: 48158371
+ In Collection: EfficientNetV2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in21k_20221220-c0572b56.pth
+ Config: configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s_21k-6337ad01.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-m_3rdparty_in21k
+ Metadata:
+ FLOPs: 5861638208
+ Parameters: 80839239
+ In Collection: EfficientNetV2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in21k_20221220-073e944c.pth
+ Config: configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_m_21k-361418a2.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-l_3rdparty_in21k
+ Metadata:
+ FLOPs: 13114950464
+ Parameters: 145215155
+ In Collection: EfficientNetV2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in21k_20221220-f28f91e1.pth
+ Config: configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_l_21k-91a19ec9.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+ - Name: efficientnetv2-xl_3rdparty_in21k
+ Metadata:
+ FLOPs: 18855244288
+ Parameters: 234819691
+ In Collection: EfficientNetV2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_3rdparty_in21k_20221220-b2c9329c.pth
+ Config: configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_xl_in21k-fd7e8abf.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
diff --git a/configs/eva/README.md b/configs/eva/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6e49c8abe8e88bc8eb683dd6dcc0ff06faf86f5f
--- /dev/null
+++ b/configs/eva/README.md
@@ -0,0 +1,101 @@
+# EVA
+
+> [EVA: Exploring the Limits of Masked Visual Representation Learning at Scale](https://arxiv.org/abs/2211.07636)
+
+
+
+## Abstract
+
+We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :--------------------------------------------------- | :--------: | :-------: | :-------------------------------------------------------------: | :----------------------------------------------------------------: |
+| `eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k` | 111.78 | 17.58 | [config](eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.json) |
+| `beit-l-p14_3rdparty-eva_in21k`\* | 303.18 | 81.08 | [config](eva-l-p14_headless.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth) |
+| `beit-l-p14_eva-pre_3rdparty_in21k`\* | 303.18 | 81.08 | [config](eva-l-p14_headless.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in21k_20221213-8f194fa2.pth) |
+| `beit-g-p16_3rdparty-eva_30m`\* | 1011.32 | 203.52 | [config](eva-g-p16_headless.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p16_3rdparty_30m_20221213-7bed23ee.pth) |
+| `beit-g-p14_3rdparty-eva_30m`\* | 1011.60 | 267.17 | [config](eva-g-p14_headless.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_3rdparty_30m_20221213-3b7aca97.pth) |
+| `beit-g-p14_eva-30m-pre_3rdparty_in21k`\* | 1011.60 | 267.17 | [config](eva-g-p14_headless.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k` | [EVA MAE STYLE](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth) | 86.57 | 17.58 | 83.70 | N/A | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.json) |
+| `vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k` | [EVA MAE STYLE](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth) | 86.57 | 17.58 | 69.00 | N/A | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.json) |
+| `beit-l-p14_eva-pre_3rdparty_in1k-196px`\* | [EVA](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth) | 304.14 | 61.57 | 87.94 | 98.5 | [config](eva-l-p14_8xb16_in1k-196px.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-196px_20221214-2adf4d28.pth) |
+| `beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px`\* | EVA ImageNet-21k | 304.14 | 61.57 | 88.58 | 98.65 | [config](eva-l-p14_8xb16_in1k-196px.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px_20221213-b730c7e7.pth) |
+| `beit-l-p14_eva-pre_3rdparty_in1k-336px`\* | [EVA](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth) | 304.53 | 191.10 | 88.66 | 98.75 | [config](eva-l-p14_8xb16_in1k-336px.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-336px_20221214-07785cfd.pth) |
+| `beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px`\* | EVA ImageNet-21k | 304.53 | 191.10 | 89.17 | 98.86 | [config](eva-l-p14_8xb16_in1k-336px.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px_20221213-f25b7634.pth) |
+| `beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px`\* | [EVA 30M ImageNet-21k](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth) | 1013.01 | 620.64 | 89.61 | 98.93 | [config](eva-g-p14_8xb16_in1k-336px.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px_20221213-210f9071.pth) |
+| `beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px`\* | [EVA 30M ImageNet-21k](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth) | 1014.45 | 1906.76 | 89.71 | 98.96 | [config](eva-g-p14_8xb16_in1k-560px.py) | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px_20221213-fa1c3652.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{EVA,
+ title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
+ author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
+ journal={arXiv preprint arXiv:2211.07636},
+ year={2022}
+}
+```
diff --git a/configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a3f4983ac19208090ee63e9c9160b945b22ee6
--- /dev/null
+++ b/configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.1,
+ out_type='avg_featmap',
+ final_norm=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=4e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.65,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ by_epoch=True,
+ begin=5,
+ end=100,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py b/configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b7333ca475ad1d9607ddda898acb623e1bd7aa4
--- /dev/null
+++ b/configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ frozen_stages=12,
+ out_type='cls_token',
+ final_norm=True,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=dict(type='ClsBatchNormNeck', input_features=768),
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]),
+ data_preprocessor=dict(
+ num_classes=1000,
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True,
+ ))
+
+# optimizer
+optim_wrapper = dict(
+ _delete_=True,
+ type='AmpOptimWrapper',
+ optimizer=dict(type='LARS', lr=3.2, weight_decay=0.0, momentum=0.9),
+)
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=90,
+ by_epoch=True,
+ begin=10,
+ end=100,
+ eta_min=0.0,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+ logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/eva/eva-g-p14_8xb16_in1k-336px.py b/configs/eva/eva-g-p14_8xb16_in1k-336px.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa2bd7ee5be0167c5d69d5f1cc96a069e5f17cb5
--- /dev/null
+++ b/configs/eva/eva-g-p14_8xb16_in1k-336px.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/eva/eva-g.py',
+ '../_base_/datasets/imagenet_bs16_eva_336.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=336))
diff --git a/configs/eva/eva-g-p14_8xb16_in1k-560px.py b/configs/eva/eva-g-p14_8xb16_in1k-560px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed20866b7f0dc19b919a06a71e50a205370194a0
--- /dev/null
+++ b/configs/eva/eva-g-p14_8xb16_in1k-560px.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/eva/eva-g.py',
+ '../_base_/datasets/imagenet_bs16_eva_560.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=560))
diff --git a/configs/eva/eva-g-p14_headless.py b/configs/eva/eva-g-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..b278aceab6211c55702c69beb1b396f37064a8b9
--- /dev/null
+++ b/configs/eva/eva-g-p14_headless.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='eva-g',
+ img_size=224,
+ patch_size=14,
+ layer_scale_init_value=0.0,
+ out_type='avg_featmap',
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ use_shared_rel_pos_bias=False,
+ ),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/eva/eva-g-p16_headless.py b/configs/eva/eva-g-p16_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca5de1860f5edb0ee768eb12ce7c528fa17e2a00
--- /dev/null
+++ b/configs/eva/eva-g-p16_headless.py
@@ -0,0 +1,24 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='eva-g',
+ img_size=224,
+ patch_size=16,
+ layer_scale_init_value=0.0,
+ out_type='avg_featmap',
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ use_shared_rel_pos_bias=False,
+ ),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/eva/eva-l-p14_8xb16_in1k-196px.py b/configs/eva/eva-l-p14_8xb16_in1k-196px.py
new file mode 100644
index 0000000000000000000000000000000000000000..3503ca5d78022e29f1c1c945aa1226085f1c3eb6
--- /dev/null
+++ b/configs/eva/eva-l-p14_8xb16_in1k-196px.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/eva/eva-l.py',
+ '../_base_/datasets/imagenet_bs16_eva_196.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=196))
diff --git a/configs/eva/eva-l-p14_8xb16_in1k-336px.py b/configs/eva/eva-l-p14_8xb16_in1k-336px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7094df8ba3de0540049eaeb4693ef5b09094dc2b
--- /dev/null
+++ b/configs/eva/eva-l-p14_8xb16_in1k-336px.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/eva/eva-l.py',
+ '../_base_/datasets/imagenet_bs16_eva_336.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=336))
diff --git a/configs/eva/eva-l-p14_headless.py b/configs/eva/eva-l-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..89a4ce10990489daf92e95c1355669f242838ff3
--- /dev/null
+++ b/configs/eva/eva-l-p14_headless.py
@@ -0,0 +1,25 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='BEiTViT',
+ arch='l',
+ img_size=224,
+ patch_size=14,
+ layer_scale_init_value=0.0,
+ out_type='avg_featmap',
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ use_shared_rel_pos_bias=False,
+ layer_cfgs=dict(bias=True),
+ ),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py b/configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbedb07c727aaa38c2de9f57fa6cfe9fdbdd87a2
--- /dev/null
+++ b/configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
@@ -0,0 +1,86 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+ type='EVA',
+ backbone=dict(init_cfg=[
+ dict(type='Xavier', distribution='uniform', layer='Linear'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ]),
+ neck=dict(
+ type='MAEPretrainDecoder',
+ predict_feature_dim=512,
+ init_cfg=[
+ dict(type='Xavier', distribution='uniform', layer='Linear'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ]),
+ head=dict(
+ _delete_=True,
+ type='MIMHead',
+ loss=dict(
+ type='CosineSimilarityLoss', shift_factor=2.0, scale_factor=2.0),
+ ),
+ target_generator=dict(
+ type='CLIPGenerator',
+ tokenizer_path= # noqa
+ 'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar' # noqa
+ ),
+ init_cfg=None)
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=360,
+ by_epoch=True,
+ begin=40,
+ end=400,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/eva/metafile.yml b/configs/eva/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..dd8dbbf761486532d228bbf3df5ef396b92d4880
--- /dev/null
+++ b/configs/eva/metafile.yml
@@ -0,0 +1,261 @@
+Collections:
+ - Name: EVA
+ Metadata:
+ Architecture:
+ - Attention Dropout
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ - Tanh Activation
+ Paper:
+ Title: 'EVA: Exploring the Limits of Masked Visual Representation Learning at
+ Scale'
+ URL: https://arxiv.org/abs/2211.07636
+ README: configs/eva/README.md
+ Code:
+ URL: null
+ Version: null
+
+Models:
+ - Name: eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k
+ Metadata:
+ Epochs: 400
+ Batch Size: 4096
+ FLOPs: 17581972224
+ Parameters: 111776512
+ Training Data: ImageNet-1k
+ In Collection: EVA
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth
+ Config: configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
+ Downstream:
+ - vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
+ - vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
+ - Name: vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: EVA
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.7
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth
+ Config: configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 16384
+ FLOPs: 17581972992
+ Parameters: 86567656
+ Training Data: ImageNet-1k
+ In Collection: EVA
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 69.0
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.pth
+ Config: configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
+ - Name: beit-l-p14_eva-pre_3rdparty_in1k-196px
+ Metadata:
+ FLOPs: 61565981696
+ Parameters: 304142312
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 87.94
+ Top 5 Accuracy: 98.5
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-196px_20221214-2adf4d28.pth
+ Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_1k_ft_88p0.pt
+ Code: https://github.com/baaivision/EVA
+ - Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
+ Metadata:
+ FLOPs: 61565981696
+ Parameters: 304142312
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 88.58
+ Top 5 Accuracy: 98.65
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px_20221213-b730c7e7.pth
+ Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_21k_to_1k_ft_88p6.pt
+ Code: https://github.com/baaivision/EVA
+ - Name: beit-l-p14_3rdparty-eva_in21k
+ Metadata:
+ FLOPs: 81075147776
+ Parameters: 303178752
+ Training Data:
+ - ImageNet-21k
+ In Collection: EVA
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth
+ Config: configs/eva/eva-l-p14_headless.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14.pt
+ Code: https://github.com/baaivision/EVA
+ Downstream:
+ - beit-l-p14_eva-pre_3rdparty_in21k
+ - beit-l-p14_eva-pre_3rdparty_in1k-336px
+ - beit-l-p14_eva-pre_3rdparty_in1k-196px
+ - Name: beit-l-p14_eva-pre_3rdparty_in21k
+ Metadata:
+ FLOPs: 81075147776
+ Parameters: 303178752
+ Training Data:
+ - ImageNet-21k
+ In Collection: EVA
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in21k_20221213-8f194fa2.pth
+ Config: configs/eva/eva-l-p14_headless.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_21k_ft.pt
+ Code: https://github.com/baaivision/EVA
+ - Name: beit-l-p14_eva-pre_3rdparty_in1k-336px
+ Metadata:
+ FLOPs: 191100916736
+ Parameters: 304531432
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 88.66
+ Top 5 Accuracy: 98.75
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-336px_20221214-07785cfd.pth
+ Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_1k_ft_88p65.pt
+ Code: https://github.com/baaivision/EVA
+ Downstream:
+ - beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
+ - beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
+ - Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
+ Metadata:
+ FLOPs: 191100916736
+ Parameters: 304531432
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 89.17
+ Top 5 Accuracy: 98.86
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px_20221213-f25b7634.pth
+ Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_21k_to_1k_ft_89p2.pt
+ Code: https://github.com/baaivision/EVA
+ - Name: beit-g-p16_3rdparty-eva_30m
+ Metadata:
+ FLOPs: 203517463424
+ Parameters: 1011315072
+ Training Data:
+ - merged-30M
+ In Collection: EVA
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p16_3rdparty_30m_20221213-7bed23ee.pth
+ Config: configs/eva/eva-g-p16_headless.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14to16.pt
+ Code: https://github.com/baaivision/EVA
+ - Name: beit-g-p14_3rdparty-eva_30m
+ Metadata:
+ FLOPs: 267174833024
+ Parameters: 1011596672
+ Training Data:
+ - merged-30M
+ In Collection: EVA
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_3rdparty_30m_20221213-3b7aca97.pth
+ Config: configs/eva/eva-g-p14_headless.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14.pt
+ Code: https://github.com/baaivision/EVA
+ Downstream:
+ - beit-g-p14_eva-30m-pre_3rdparty_in21k
+ - Name: beit-g-p14_eva-30m-pre_3rdparty_in21k
+ Metadata:
+ FLOPs: 267174833024
+ Parameters: 1011596672
+ Training Data:
+ - merged-30M
+ - ImageNet-21k
+ In Collection: EVA
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth
+ Config: configs/eva/eva-g-p14_headless.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_224px_psz14.pt
+ Code: https://github.com/baaivision/EVA
+ Downstream:
+ - beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
+ - beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
+ - Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
+ Metadata:
+ FLOPs: 620642757504
+ Parameters: 1013005672
+ Training Data:
+ - merged-30M
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 89.61
+ Top 5 Accuracy: 98.93
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px_20221213-210f9071.pth
+ Config: configs/eva/eva-g-p14_8xb16_in1k-336px.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_336px_psz14_ema_89p6.pt
+ Code: https://github.com/baaivision/EVA
+ - Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
+ Metadata:
+ FLOPs: 1906761591680
+ Parameters: 1014447464
+ Training Data:
+ - merged-30M
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 89.71
+ Top 5 Accuracy: 98.96
+ Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px_20221213-fa1c3652.pth
+ Config: configs/eva/eva-g-p14_8xb16_in1k-560px.py
+ Converted From:
+ Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_560px_psz14_ema_89p7.pt
+ Code: https://github.com/baaivision/EVA
diff --git a/configs/eva02/README.md b/configs/eva02/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bc8f64e76d1601ade6ef052a2f23f7d2f6123843
--- /dev/null
+++ b/configs/eva02/README.md
@@ -0,0 +1,109 @@
+# EVA-02
+
+> [EVA-02: A Visual Representation for Neon Genesis](https://arxiv.org/abs/2303.11331)
+
+
+
+## Abstract
+
+We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open accessand open research, we release the complete suite of EVA-02 to the community.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', pretrained=True)
+inputs = torch.rand(1, 3, 336, 336)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/eva02/eva02-tiny-p14_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/eva02/eva02-tiny-p14_in1k.py /path/to/eva02-tiny-p14_in1k.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :-------------------------------- | :--------: | :-------: | :-----------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
+| `vit-tiny-p14_eva02-pre_in21k`\* | 5.50 | 1.70 | [config](eva02-tiny-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth) |
+| `vit-small-p14_eva02-pre_in21k`\* | 21.62 | 6.14 | [config](eva02-small-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth) |
+| `vit-base-p14_eva02-pre_in21k`\* | 85.77 | 23.22 | [config](eva02-base-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth) |
+| `vit-large-p14_eva02-pre_in21k`\* | 303.29 | 81.15 | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth) |
+| `vit-large-p14_eva02-pre_m38m`\* | 303.29 | 81.15 | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth) |
+
+- The input size / patch size of MIM pre-trained EVA-02 is `224x224` / `14x14`.
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA).*
+
+### Image Classification on ImageNet-1k
+
+#### (*w/o* IN-21K intermediate fine-tuning)
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
+| `vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px`\* | EVA02 ImageNet-21k | 5.76 | 4.68 | 80.69 | 95.54 | [config](./eva02-tiny-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth) |
+| `vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px`\* | EVA02 ImageNet-21k | 22.13 | 15.48 | 85.78 | 97.60 | [config](./eva02-small-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth) |
+| `vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k | 87.13 | 107.11 | 88.29 | 98.53 | [config](./eva02-base-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+#### (*w* IN-21K intermediate fine-tuning)
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
+| `vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k | 87.13 | 107.11 | 88.47 | 98.62 | [config](./eva02-base-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth) |
+| `vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k | 305.08 | 362.33 | 89.65 | 98.95 | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth) |
+| `vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 Merged-38M | 305.10 | 362.33 | 89.83 | 99.00 | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{EVA-02,
+ title={EVA-02: A Visual Representation for Neon Genesis},
+ author={Yuxin Fang and Quan Sun and Xinggang Wang and Tiejun Huang and Xinlong Wang and Yue Cao},
+ journal={arXiv preprint arXiv:2303.11331},
+ year={2023}
+}
+```
diff --git a/configs/eva02/eva02-base-p14_headless.py b/configs/eva02/eva02-base-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..27aa8f8a502810d39865ee85fd45b5152c8d5269
--- /dev/null
+++ b/configs/eva02/eva02-base-p14_headless.py
@@ -0,0 +1,21 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTEVA02',
+ arch='b',
+ img_size=224,
+ patch_size=14,
+ sub_ln=True,
+ final_norm=False,
+ out_type='avg_featmap'),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/eva02/eva02-base-p14_in1k.py b/configs/eva02/eva02-base-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8400d38542d71ee5d3f9713e34236bdc0e7783a
--- /dev/null
+++ b/configs/eva02/eva02-base-p14_in1k.py
@@ -0,0 +1,32 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs16_eva_448.py',
+ '../_base_/schedules/imagenet_bs2048_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTEVA02',
+ arch='b',
+ img_size=448,
+ patch_size=14,
+ sub_ln=True,
+ final_norm=False,
+ out_type='avg_featmap'),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/eva02/eva02-large-p14_headless.py b/configs/eva02/eva02-large-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..e101ac977c8590572190350292325c78477dbfd3
--- /dev/null
+++ b/configs/eva02/eva02-large-p14_headless.py
@@ -0,0 +1,21 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTEVA02',
+ arch='l',
+ img_size=224,
+ patch_size=14,
+ sub_ln=True,
+ final_norm=False,
+ out_type='avg_featmap'),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/eva02/eva02-large-p14_in1k.py b/configs/eva02/eva02-large-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..91a42776dafd0f78ba6f3c1fbe68bfc602ad502e
--- /dev/null
+++ b/configs/eva02/eva02-large-p14_in1k.py
@@ -0,0 +1,32 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs16_eva_448.py',
+ '../_base_/schedules/imagenet_bs2048_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTEVA02',
+ arch='l',
+ img_size=448,
+ patch_size=14,
+ sub_ln=True,
+ final_norm=False,
+ out_type='avg_featmap'),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/eva02/eva02-small-p14_headless.py b/configs/eva02/eva02-small-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..a969819308e9cea449b06ae3533839d72a2b96fe
--- /dev/null
+++ b/configs/eva02/eva02-small-p14_headless.py
@@ -0,0 +1,20 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTEVA02',
+ arch='s',
+ img_size=224,
+ patch_size=14,
+ final_norm=False,
+ out_type='avg_featmap'),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/eva02/eva02-small-p14_in1k.py b/configs/eva02/eva02-small-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a16d92456e39bb1147423682333cd24673133e6
--- /dev/null
+++ b/configs/eva02/eva02-small-p14_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs16_eva_336.py',
+ '../_base_/schedules/imagenet_bs2048_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTEVA02',
+ arch='s',
+ img_size=336,
+ patch_size=14,
+ final_norm=False,
+ out_type='avg_featmap'),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/eva02/eva02-tiny-p14_headless.py b/configs/eva02/eva02-tiny-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..783d0ea2ebf35df3af8072958322f4f572e36210
--- /dev/null
+++ b/configs/eva02/eva02-tiny-p14_headless.py
@@ -0,0 +1,20 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTEVA02',
+ arch='t',
+ img_size=224,
+ patch_size=14,
+ final_norm=False,
+ out_type='avg_featmap'),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+ std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/eva02/eva02-tiny-p14_in1k.py b/configs/eva02/eva02-tiny-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..84e68d7edd92d91689aa501397a9dbe3eba0b8b3
--- /dev/null
+++ b/configs/eva02/eva02-tiny-p14_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs16_eva_336.py',
+ '../_base_/schedules/imagenet_bs2048_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTEVA02',
+ arch='t',
+ img_size=336,
+ patch_size=14,
+ final_norm=False,
+ out_type='avg_featmap'),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
diff --git a/configs/eva02/metafile.yml b/configs/eva02/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..80acf904fb46e95f0ae52b1ff6fe3cf620cc8ae7
--- /dev/null
+++ b/configs/eva02/metafile.yml
@@ -0,0 +1,199 @@
+Collections:
+ - Name: EVA02
+ Metadata:
+ Architecture:
+ - Rotary Position Embedding
+ - Sub Layer Normalization
+ - SwiGLU
+ Paper:
+ Title: 'EVA-02: A Visual Representation for Neon Genesis'
+ URL: https://arxiv.org/abs/2303.11331
+ README: configs/eva02/README.md
+
+Models:
+ - Name: vit-tiny-p14_eva02-pre_in21k
+ Metadata:
+ FLOPs: 1703439360
+ Parameters: 5504064
+ Training Data:
+ - ImageNet-21k
+ In Collection: EVA02
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth
+ Config: configs/eva02/eva02-tiny-p14_headless.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_Ti_pt_in21k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ Downstream:
+ - vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
+ - Name: vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
+ Metadata:
+ FLOPs: 4675416000
+ Parameters: 5758888
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA02
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 80.69
+ Top 5 Accuracy: 95.54
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth
+ Config: configs/eva02/eva02-tiny-p14_in1k.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_Ti_pt_in21k_ft_in1k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ - Name: vit-small-p14_eva02-pre_in21k
+ Metadata:
+ FLOPs: 6135404544
+ Parameters: 21624960
+ Training Data:
+ - ImageNet-21k
+ In Collection: EVA02
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth
+ Config: configs/eva02/eva02-small-p14_headless.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_S_pt_in21k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ Downstream:
+ - vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
+ - Name: vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
+ Metadata:
+ FLOPs: 15476744064
+ Parameters: 22133608
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA02
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.78
+ Top 5 Accuracy: 97.60
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth
+ Config: configs/eva02/eva02-small-p14_in1k.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_S_pt_in21k_ft_in1k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ - Name: vit-base-p14_eva02-pre_in21k
+ Metadata:
+ FLOPs: 23216492544
+ Parameters: 85766400
+ Training Data:
+ - ImageNet-21k
+ In Collection: EVA02
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth
+ Config: configs/eva02/eva02-base-p14_headless.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_B_pt_in21k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ Downstream:
+ - vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
+ - vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+ - Name: vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
+ Metadata:
+ FLOPs: 107105984256
+ Parameters: 87126760
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA02
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 88.29
+ Top 5 Accuracy: 98.53
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth
+ Config: configs/eva02/eva02-base-p14_in1k.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_B_pt_in21k_ft_in1k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ - Name: vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+ Metadata:
+ FLOPs: 107105984256
+ Parameters: 87126760
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA02
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 88.47
+ Top 5 Accuracy: 98.62
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth
+ Config: configs/eva02/eva02-base-p14_in1k.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_B_pt_in21k_medft_in21k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ - Name: vit-large-p14_eva02-pre_in21k
+ Metadata:
+ FLOPs: 81146703792
+ Parameters: 303291328
+ Training Data:
+ - ImageNet-21k
+ In Collection: EVA02
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth
+ Config: configs/eva02/eva02-large-p14_headless.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_in21k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ Downstream:
+ - vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+ - Name: vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+ Metadata:
+ FLOPs: 362333836208
+ Parameters: 305104808
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA02
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 89.65
+ Top 5 Accuracy: 98.95
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth
+ Config: configs/eva02/eva02-large-p14_in1k.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_in21k_medft_in21k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ - Name: vit-large-p14_eva02-pre_m38m
+ Metadata:
+ FLOPs: 81146703792
+ Parameters: 303291328
+ Training Data:
+ - Merged-38M
+ In Collection: EVA02
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth
+ Config: configs/eva02/eva02-large-p14_headless.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_m38m_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+ Downstream:
+ - vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
+ - Name: vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
+ Metadata:
+ FLOPs: 362333836208
+ Parameters: 305104808
+ Training Data:
+ - Merged-38M
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: EVA02
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 89.83
+ Top 5 Accuracy: 99.00
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth
+ Config: configs/eva02/eva02-large-p14_in1k.py
+ Converted From:
+ Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_m38m_medft_in21k_p14.pt
+ Code: https://github.com/baaivision/EVA/tree/master/EVA-02
diff --git a/configs/flamingo/README.md b/configs/flamingo/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..60c6af0f50e43cb0f84d2a3dbd2d343a435c6310
--- /dev/null
+++ b/configs/flamingo/README.md
@@ -0,0 +1,82 @@
+# Flamingo
+
+> [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)
+
+
+
+## Abstract
+
+Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('flamingo_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'A dog and a cat are looking at each other. '}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/flamingo/flamingo_zeroshot_caption.py https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+```
+
+
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model | Params (G) | CIDER | Config | Download |
+| :------------------------------------- | :--------: | :---: | :------------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
+| `flamingo_3rdparty-zeroshot_caption`\* | 8.220 | 65.50 | [config](flamingo_zeroshot_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
+
+*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Question Answering on VQAv2
+
+| Model | Params (G) | Accuracy | Config | Download |
+| :--------------------------------- | :--------: | :------: | :--------------------------------: | :----------------------------------------------------------------------------------------------------------------: |
+| `flamingo_3rdparty-zeroshot_vqa`\* | 8.22 | 43.50 | [config](flamingo_zeroshot_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
+
+*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Alayrac2022FlamingoAV,
+ title={Flamingo: a Visual Language Model for Few-Shot Learning},
+ author={Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan},
+ journal={ArXiv},
+ year={2022},
+ volume={abs/2204.14198}
+}
+```
+
+```bibtex
+@software{anas_awadalla_2023_7733589,
+ author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig},
+ title = {OpenFlamingo},
+ month = mar,
+ year = 2023,
+ publisher = {Zenodo},
+ version = {v0.1.1},
+ doi = {10.5281/zenodo.7733589},
+ url = {https://doi.org/10.5281/zenodo.7733589}
+}
+```
diff --git a/configs/flamingo/flamingo_fewshot_caption.py b/configs/flamingo/flamingo_fewshot_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6f9c2bfccdfb9617a14fae454af9bf209f3199a
--- /dev/null
+++ b/configs/flamingo/flamingo_fewshot_caption.py
@@ -0,0 +1,95 @@
+_base_ = [
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='Flamingo',
+ tokenizer=dict(
+ type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='l',
+ patch_size=14,
+ pre_norm=True,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ final_norm=False,
+ out_type='raw',
+ pretrained=(
+ 'https://download.openmmlab.com/mmclassification/v0/clip/'
+ 'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+ ),
+ lang_encoder=dict(
+ base=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='decapoda-research/llama-7b-hf',
+ local_files_only=True),
+ adapter=dict(
+ type='FlamingoLMAdapter',
+ vis_hidden_size=1024,
+ cross_attn_every_n_layers=4,
+ use_media_placement_augmentation=False),
+ ),
+ task='caption',
+ shot_prompt_tmpl='Output:{caption}<|endofchunk|>',
+ final_prompt_tmpl='Output:',
+ generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0))
+
+# data settings
+data_preprocessor = dict(
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(
+ type='ApplyToList',
+ # Flamingo requires to load multiple images during few-shot inference.
+ scatter_key='img_path',
+ transforms=[
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=224,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CenterCrop', crop_size=(224, 224)),
+ ],
+ collate_keys=['img', 'scale_factor', 'ori_shape'],
+ ),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['gt_caption', 'shots'],
+ meta_keys=['image_id']),
+]
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=8,
+ dataset=dict(
+ type='FlamingoEvalCOCOCaption',
+ data_root='data/coco',
+ ann_file='annotations/captions_train2014.json',
+ data_prefix=dict(img_path='train2014'),
+ pipeline=test_pipeline,
+ num_shots=2,
+ num_support_examples=2048,
+ num_query_examples=5000,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+val_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/coco/annotations/captions_train2014.json')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/flamingo_fewshot_vqa.py b/configs/flamingo/flamingo_fewshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..b85a6989b75b4cd1d7bf585cb83b40add12f104f
--- /dev/null
+++ b/configs/flamingo/flamingo_fewshot_vqa.py
@@ -0,0 +1,109 @@
+_base_ = [
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='Flamingo',
+ tokenizer=dict(
+ type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='l',
+ patch_size=14,
+ pre_norm=True,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ final_norm=False,
+ out_type='raw',
+ pretrained=(
+ 'https://download.openmmlab.com/mmclassification/v0/clip/'
+ 'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+ ),
+ lang_encoder=dict(
+ base=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='decapoda-research/llama-7b-hf',
+ local_files_only=True),
+ adapter=dict(
+ type='FlamingoLMAdapter',
+ vis_hidden_size=1024,
+ cross_attn_every_n_layers=4,
+ use_media_placement_augmentation=False),
+ ),
+ task='vqa',
+ shot_prompt_tmpl=
+ 'Question:{question} Short Answer:{answer}<|endofchunk|>',
+ final_prompt_tmpl='Question:{question} Short Answer:',
+ generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
+
+# data settings
+data_preprocessor = dict(
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(
+ type='ApplyToList',
+ # Flamingo requires to load multiple images during few-shot inference.
+ scatter_key='img_path',
+ transforms=[
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=224,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CenterCrop', crop_size=(224, 224)),
+ ],
+ collate_keys=['img', 'scale_factor', 'ori_shape'],
+ ),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+ meta_keys=['image_id']),
+]
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=8,
+ dataset=dict(
+ type='FlamingoEvalCOCOVQA',
+ data_root='data/coco',
+ data_prefix='val2014',
+ question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+ ann_file='annotations/v2_mscoco_val2014_annotations.json',
+ pipeline=test_pipeline,
+ num_shots=2,
+ num_support_examples=2048,
+ num_query_examples=5000,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+ batch_size=8,
+ num_workers=8,
+ dataset=dict(
+ type='FlamingoEvalCOCOVQA',
+ data_root='data/coco',
+ data_prefix='test2015',
+ question_file=
+ 'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+ pipeline=test_pipeline,
+ num_shots=0,
+ num_support_examples=2048,
+ num_query_examples=5000,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/flamingo_zeroshot_caption.py b/configs/flamingo/flamingo_zeroshot_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..deb786e4d56e70abd26723462068dfb9ad4ed9aa
--- /dev/null
+++ b/configs/flamingo/flamingo_zeroshot_caption.py
@@ -0,0 +1,95 @@
+_base_ = [
+ '../_base_/default_runtime.py',
+]
+
+zeroshot_prompt = (
+ 'Output:A child holding a flowered umbrella and petting a yak.<|endofchunk|>' # noqa: E501
+ 'Output:The child is holding a brush close to his mouth.<|endofchunk|>' # noqa: E501
+)
+
+# model settings
+model = dict(
+ type='Flamingo',
+ tokenizer=dict(
+ type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='l',
+ patch_size=14,
+ pre_norm=True,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ final_norm=False,
+ out_type='raw',
+ pretrained=(
+ 'https://download.openmmlab.com/mmclassification/v0/clip/'
+ 'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+ ),
+ lang_encoder=dict(
+ base=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='decapoda-research/llama-7b-hf',
+ local_files_only=True),
+ adapter=dict(
+ type='FlamingoLMAdapter',
+ vis_hidden_size=1024,
+ cross_attn_every_n_layers=4,
+ use_media_placement_augmentation=False),
+ ),
+ task='caption',
+ zeroshot_prompt=zeroshot_prompt,
+ final_prompt_tmpl='Output:',
+ generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=224,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CenterCrop', crop_size=(224, 224)),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['gt_caption'],
+ meta_keys=['image_id'],
+ ),
+]
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=8,
+ dataset=dict(
+ type='FlamingoEvalCOCOCaption',
+ data_root='data/coco',
+ ann_file='annotations/captions_train2014.json',
+ data_prefix=dict(img_path='train2014'),
+ pipeline=test_pipeline,
+ num_shots=0,
+ num_support_examples=2048,
+ num_query_examples=5000,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+val_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/coco/annotations/captions_train2014.json')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/flamingo_zeroshot_vqa.py b/configs/flamingo/flamingo_zeroshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..c43c7b8686679364490aa8acf893c61f4c5500f7
--- /dev/null
+++ b/configs/flamingo/flamingo_zeroshot_vqa.py
@@ -0,0 +1,107 @@
+_base_ = [
+ '../_base_/default_runtime.py',
+]
+
+zeroshot_prompt = (
+ 'Question:What is this photo taken looking through? Short Answer:pitcher<|endofchunk|>' # noqa: E501
+ 'Question:How many people are wearing shorts in the forefront of this photo? Short Answer:4<|endofchunk|>' # noqa: E501
+)
+
+# model settings
+model = dict(
+ type='Flamingo',
+ tokenizer=dict(
+ type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='l',
+ patch_size=14,
+ pre_norm=True,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ final_norm=False,
+ out_type='raw',
+ pretrained=(
+ 'https://download.openmmlab.com/mmclassification/v0/clip/'
+ 'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+ ),
+ lang_encoder=dict(
+ base=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='decapoda-research/llama-7b-hf',
+ local_files_only=True),
+ adapter=dict(
+ type='FlamingoLMAdapter',
+ vis_hidden_size=1024,
+ cross_attn_every_n_layers=4,
+ use_media_placement_augmentation=False),
+ ),
+ task='vqa',
+ zeroshot_prompt=zeroshot_prompt,
+ final_prompt_tmpl='Question:{question} Short Answer:',
+ generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=224,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CenterCrop', crop_size=(224, 224)),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+ meta_keys=['image_id'],
+ ),
+]
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=8,
+ dataset=dict(
+ type='FlamingoEvalCOCOVQA',
+ data_root='data/coco',
+ data_prefix='val2014',
+ question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+ ann_file='annotations/v2_mscoco_val2014_annotations.json',
+ pipeline=test_pipeline,
+ num_shots=0,
+ num_support_examples=2048,
+ num_query_examples=5000,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+ batch_size=8,
+ num_workers=8,
+ dataset=dict(
+ type='FlamingoEvalCOCOVQA',
+ data_root='data/coco',
+ data_prefix='test2015',
+ question_file=
+ 'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+ pipeline=test_pipeline,
+ num_shots=0,
+ num_support_examples=2048,
+ num_query_examples=5000,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/metafile.yml b/configs/flamingo/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6ff33e93b24ce1e10efb57c7465e9e6663709f97
--- /dev/null
+++ b/configs/flamingo/metafile.yml
@@ -0,0 +1,42 @@
+Collections:
+ - Name: Flamingo
+ Metadata:
+ Architecture:
+ - Transformer
+ - Gated Cross-Attention Dense
+ Paper:
+ Title: 'Flamingo: a Visual Language Model for Few-Shot Learning'
+ URL: https://arxiv.org/abs/2204.14198
+ README: configs/flamingo/README.md
+
+Models:
+ - Name: flamingo_3rdparty-zeroshot_caption
+ Metadata:
+ FLOPs: null
+ Parameters: 8220452880
+ In Collection: Flamingo
+ Results:
+ - Task: Image Caption
+ Dataset: COCO
+ Metrics:
+ CIDER: 65.50 # Report from the official repo
+ Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+ Config: configs/flamingo/flamingo_zeroshot_caption.py
+ Converted From:
+ Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
+ Code: https://github.com/mlfoundations/open_flamingo
+ - Name: flamingo_3rdparty-zeroshot_vqa
+ Metadata:
+ FLOPs: null
+ Parameters: 8220452880
+ In Collection: Flamingo
+ Results:
+ - Task: Visual Question Answering
+ Dataset: VQAv2
+ Metrics:
+ Accuracy: 43.50 # Report from the official repo
+ Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+ Config: configs/flamingo/flamingo_zeroshot_vqa.py
+ Converted From:
+ Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
+ Code: https://github.com/mlfoundations/open_flamingo
diff --git a/configs/glip/README.md b/configs/glip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..48ee30560a92b8ce3c926f536f625b67cca957c2
--- /dev/null
+++ b/configs/glip/README.md
@@ -0,0 +1,57 @@
+# GLIP
+
+> [Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857)
+
+
+
+## Abstract
+
+This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('swin-t_glip-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+
+
+## Results and models
+
+### Pre-trained models
+
+The pre-trained models are used to fine-tune, and therefore don't have evaluation results.
+
+| Model | Pretrain | resolution | Download |
+| :------------------------------------------ | :------------------------: | :--------: | :-------------------------------------------------------------------------------------------------------------------: |
+| GLIP-T (`swin-t_glip-pre_3rdparty`)\* | O365,GoldG,CC3M,SBU | 224x224 | [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth) |
+| GLIP-L (`swin-l_glip-pre_3rdparty_384px`)\* | FourODs,GoldG,CC3M+12M,SBU | 384x384 | [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/GLIP).*
+
+## Citation
+
+```bibtex
+@inproceedings{li2021grounded,
+ title={Grounded Language-Image Pre-training},
+ author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
+ year={2022},
+ booktitle={CVPR},
+}
+```
diff --git a/configs/glip/glip-l_headless.py b/configs/glip/glip-l_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..991b6b85039bf0d24237a617dfeae285f97d7555
--- /dev/null
+++ b/configs/glip/glip-l_headless.py
@@ -0,0 +1,18 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformer',
+ arch='large',
+ img_size=384,
+ out_indices=(1, 2, 3), # original weight is for detection
+ stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+ neck=None,
+ head=None)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[103.53, 116.28, 123.675],
+ std=[57.375, 57.12, 58.395],
+ # convert image from BGR to RGB
+ to_rgb=False,
+)
diff --git a/configs/glip/glip-t_headless.py b/configs/glip/glip-t_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..08b89f8f1e02a1d1fa230e437e6b6e3ac873821f
--- /dev/null
+++ b/configs/glip/glip-t_headless.py
@@ -0,0 +1,18 @@
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='SwinTransformer',
+ arch='tiny',
+ img_size=224,
+ out_indices=(1, 2, 3), # original weight is for detection
+ ),
+ neck=None,
+ head=None)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[103.53, 116.28, 123.675],
+ std=[57.375, 57.12, 58.395],
+ # convert image from BGR to RGB
+ to_rgb=False,
+)
diff --git a/configs/glip/metafile.yml b/configs/glip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..0691fd0d06c184082718be80d110a52dd9fae06b
--- /dev/null
+++ b/configs/glip/metafile.yml
@@ -0,0 +1,49 @@
+Collections:
+ - Name: GLIP
+ Metadata:
+ Training Techniques:
+ - AdamW
+ - Weight Decay
+ Architecture:
+ - Shift Window Multihead Self Attention
+ Paper:
+ URL: https://arxiv.org/abs/2112.03857
+ Title: "Grounded Language-Image Pre-training"
+ README: configs/glip/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/vit.py
+ Version: v1.0.0rc8
+
+Models:
+ - Name: swin-t_glip-pre_3rdparty
+ In Collection: GLIP
+ Metadata:
+ FLOPs: 4508464128
+ Parameters: 29056354
+ Training Data:
+ - O365
+ - GoldG
+ - CC3M
+ - SBU
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth
+ Converted From:
+ Weights: https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_tiny_model_o365_goldg_cc_sbu.pth
+ Code: https://github.com/microsoft/GLIP
+ Config: configs/glip/glip-t_headless.py
+ - Name: swin-l_glip-pre_3rdparty_384px
+ In Collection: GLIP
+ Metadata:
+ FLOPs: 104080343040
+ Parameters: 196735516
+ Training Data:
+ - FourODs
+ - GoldG
+ - CC3M+12M
+ - SBU
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth
+ Converted From:
+ Weights: https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_large_model.pth
+ Code: https://github.com/microsoft/GLIP
+ Config: configs/glip/glip-l_headless.py
diff --git a/configs/hivit/README.md b/configs/hivit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..18ae0862c5db52a7f6f82451d398ee3e47d709ce
--- /dev/null
+++ b/configs/hivit/README.md
@@ -0,0 +1,81 @@
+# HiViT
+
+> [HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer](https://arxiv.org/abs/2205.14949)
+
+
+
+## Abstract
+
+Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage. Empirical studies demonstrate the advantageous performance of HiViT in terms of fully-supervised, self-supervised, and transfer learning. In particular, in running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$\times$ speed-up over Swin-B, and the performance gain generalizes to downstream tasks of detection and segmentation. Code will be made publicly available.
+
+
+

+
+
+## How to use it?
+
+
+
+
+
+
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/hivit/hivit-tiny-p16_16xb64_in1k.py
+```
+
+
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :--------------------------------------: | :------: |
+| `hivit-tiny-p16_16xb64_in1k` | From scratch | 19.18 | 4.60 | 82.10 | [config](hivit-tiny-p16_16xb64_in1k.py) | N/A |
+| `hivit-small-p16_16xb64_in1k` | From scratch | 37.53 | 9.07 | N/A | [config](hivit-small-p16_16xb64_in1k.py) | N/A |
+| `hivit-base-p16_16xb64_in1k` | From scratch | 79.05 | 18.47 | N/A | [config](hivit-base-p16_16xb64_in1k.py) | N/A |
+
+## Citation
+
+```bibtex
+@inproceedings{zhanghivit,
+ title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
+ author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
+ booktitle={International Conference on Learning Representations},
+ year={2023},
+}
+```
diff --git a/configs/hivit/hivit-base-p16_16xb64_in1k.py b/configs/hivit/hivit-base-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d37dcda86ba8db69cea47477f240e24564fcf91f
--- /dev/null
+++ b/configs/hivit/hivit-base-p16_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/hivit/base_224.py',
+ '../_base_/datasets/imagenet_bs64_hivit_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_hivit.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/hivit/hivit-small-p16_16xb64_in1k.py b/configs/hivit/hivit-small-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fa3976672e839354c8a215ded9a02874ab78aca
--- /dev/null
+++ b/configs/hivit/hivit-small-p16_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/hivit/small_224.py',
+ '../_base_/datasets/imagenet_bs64_hivit_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_hivit.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/hivit/hivit-tiny-p16_16xb64_in1k.py b/configs/hivit/hivit-tiny-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ed3b6a7ae95a232995c50d26002fd6d5aa0fbe1
--- /dev/null
+++ b/configs/hivit/hivit-tiny-p16_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/hivit/tiny_224.py',
+ '../_base_/datasets/imagenet_bs64_hivit_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_hivit.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/hivit/metafile.yml b/configs/hivit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..67f3a6961637a1a43f64063bdcdd567c163ab3df
--- /dev/null
+++ b/configs/hivit/metafile.yml
@@ -0,0 +1,63 @@
+Collections:
+ - Name: HiViT
+ Metadata:
+ Architecture:
+ - Dense Connections
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ Paper:
+ Title: 'HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer'
+ URL: https://arxiv.org/abs/2205.14949
+ README: configs/hivit/README.md
+ Code:
+ URL: null
+ Version: null
+
+Models:
+ - Name: hivit-tiny-p16_16xb64_in1k
+ Metadata:
+ FLOPs: 4603000000
+ Parameters: 19181000
+ Training Data:
+ - ImageNet-1k
+ In Collection: HiViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.1
+ Task: Image Classification
+ Weights:
+ Config: configs/hivit/hivit-tiny-p16_16xb64_in1k.py
+
+ - Name: hivit-small-p16_16xb64_in1k
+ Metadata:
+ FLOPs: 9072000000
+ Parameters: 37526000
+ Training Data:
+ - ImageNet-1k
+ In Collection: HiViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy:
+ Task: Image Classification
+ Weights:
+ Config: configs/hivit/hivit-small-p16_16xb64_in1k.py
+
+ - Name: hivit-base-p16_16xb64_in1k
+ Metadata:
+ FLOPs: 18474000000
+ Parameters: 79051000
+ Training Data:
+ - ImageNet-1k
+ In Collection: HiViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy:
+ Task: Image Classification
+ Weights:
+ Config: configs/hivit/hivit-base-p16_16xb64_in1k.py
diff --git a/configs/hornet/README.md b/configs/hornet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b4dbf05bd35d4cfc0fc165ea857110e18ace664c
--- /dev/null
+++ b/configs/hornet/README.md
@@ -0,0 +1,80 @@
+# HorNet
+
+> [HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions](https://arxiv.org/abs/2207.14284)
+
+
+
+## Abstract
+
+Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. g nConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual encoders, we also show g nConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that g nConv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('hornet-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('hornet-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/hornet/hornet-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny_3rdparty_in1k_20220915-0e8eedff.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :-------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-------------------------------------: | :-----------------------------------------------------------------------------: |
+| `hornet-tiny_3rdparty_in1k`\* | From scratch | 22.41 | 3.98 | 82.84 | 96.24 | [config](hornet-tiny_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny_3rdparty_in1k_20220915-0e8eedff.pth) |
+| `hornet-tiny-gf_3rdparty_in1k`\* | From scratch | 22.99 | 3.90 | 82.98 | 96.38 | [config](hornet-tiny-gf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny-gf_3rdparty_in1k_20220915-4c35a66b.pth) |
+| `hornet-small_3rdparty_in1k`\* | From scratch | 49.53 | 8.83 | 83.79 | 96.75 | [config](hornet-small_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small_3rdparty_in1k_20220915-5935f60f.pth) |
+| `hornet-small-gf_3rdparty_in1k`\* | From scratch | 50.40 | 8.71 | 83.98 | 96.77 | [config](hornet-small-gf_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small-gf_3rdparty_in1k_20220915-649ca492.pth) |
+| `hornet-base_3rdparty_in1k`\* | From scratch | 87.26 | 15.58 | 84.24 | 96.94 | [config](hornet-base_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base_3rdparty_in1k_20220915-a06176bb.pth) |
+| `hornet-base-gf_3rdparty_in1k`\* | From scratch | 88.42 | 15.42 | 84.32 | 96.95 | [config](hornet-base-gf_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base-gf_3rdparty_in1k_20220915-82c06fa7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/raoyongming/HorNet). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{rao2022hornet,
+ title={HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions},
+ author={Rao, Yongming and Zhao, Wenliang and Tang, Yansong and Zhou, Jie and Lim, Ser-Lam and Lu, Jiwen},
+ journal={arXiv preprint arXiv:2207.14284},
+ year={2022}
+}
+```
diff --git a/configs/hornet/hornet-base-gf_8xb64_in1k.py b/configs/hornet/hornet-base-gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b27012df51b4bc90303d5c30df83fb24a2d76690
--- /dev/null
+++ b/configs/hornet/hornet-base-gf_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/hornet/hornet-base-gf.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=1.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-base_8xb64_in1k.py b/configs/hornet/hornet-base_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb78a7ddaac26bcde4032c8342de251c3c26fb68
--- /dev/null
+++ b/configs/hornet/hornet-base_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/hornet/hornet-base.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=5.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-small-gf_8xb64_in1k.py b/configs/hornet/hornet-small-gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..96fcc77d8ca1f693f479f795e97469240f4632c3
--- /dev/null
+++ b/configs/hornet/hornet-small-gf_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/hornet/hornet-small-gf.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=1.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-small_8xb64_in1k.py b/configs/hornet/hornet-small_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0535ade00cdff0c4a25e6570a1316216f6fd37b
--- /dev/null
+++ b/configs/hornet/hornet-small_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/hornet/hornet-small.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=5.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-tiny-gf_8xb128_in1k.py b/configs/hornet/hornet-tiny-gf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3556de9c15ccb29b98fe1a7b68ee59cbbf320536
--- /dev/null
+++ b/configs/hornet/hornet-tiny-gf_8xb128_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/hornet/hornet-tiny-gf.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=128)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=1.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-tiny_8xb128_in1k.py b/configs/hornet/hornet-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..31bd1dd3fc9c4918c3043916fc155f9eb7faad1d
--- /dev/null
+++ b/configs/hornet/hornet-tiny_8xb128_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/hornet/hornet-tiny.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=128)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=100.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/metafile.yml b/configs/hornet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..eba0ed2f4c9ac8eb758f5f5a81d023440ae53484
--- /dev/null
+++ b/configs/hornet/metafile.yml
@@ -0,0 +1,115 @@
+Collections:
+ - Name: HorNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ - Weight Decay
+ Architecture:
+ - HorNet
+ - gnConv
+ Paper:
+ URL: https://arxiv.org/abs/2207.14284
+ Title: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions"
+ README: configs/hornet/README.md
+ Code:
+ Version: v0.24.0
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.24.0/mmcls/models/backbones/hornet.py
+
+Models:
+ - Name: hornet-tiny_3rdparty_in1k
+ Metadata:
+ FLOPs: 3976156352 # 3.98G
+ Parameters: 22409512 # 22.41M
+ In Collection: HorNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.84
+ Top 5 Accuracy: 96.24
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny_3rdparty_in1k_20220915-0e8eedff.pth
+ Config: configs/hornet/hornet-tiny_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/raoyongming/HorNet
+ Weights: https://cloud.tsinghua.edu.cn/f/1ca970586c6043709a3f/?dl=1
+ - Name: hornet-tiny-gf_3rdparty_in1k
+ Metadata:
+ FLOPs: 3896472160 # 3.9G
+ Parameters: 22991848 # 22.99M
+ In Collection: HorNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.98
+ Top 5 Accuracy: 96.38
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny-gf_3rdparty_in1k_20220915-4c35a66b.pth
+ Config: configs/hornet/hornet-tiny-gf_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/raoyongming/HorNet
+ Weights: https://cloud.tsinghua.edu.cn/f/511faad0bde94dfcaa54/?dl=1
+ - Name: hornet-small_3rdparty_in1k
+ Metadata:
+ FLOPs: 8825621280 # 8.83G
+ Parameters: 49528264 # 49.53M
+ In Collection: HorNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.79
+ Top 5 Accuracy: 96.75
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small_3rdparty_in1k_20220915-5935f60f.pth
+ Config: configs/hornet/hornet-small_8xb64_in1k.py
+ Converted From:
+ Code: https://github.com/raoyongming/HorNet
+ Weights: https://cloud.tsinghua.edu.cn/f/46422799db2941f7b684/?dl=1
+ - Name: hornet-small-gf_3rdparty_in1k
+ Metadata:
+ FLOPs: 8706094992 # 8.71G
+ Parameters: 50401768 # 50.4M
+ In Collection: HorNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.98
+ Top 5 Accuracy: 96.77
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small-gf_3rdparty_in1k_20220915-649ca492.pth
+ Config: configs/hornet/hornet-small-gf_8xb64_in1k.py
+ Converted From:
+ Code: https://github.com/raoyongming/HorNet
+ Weights: https://cloud.tsinghua.edu.cn/f/8405c984bf084d2ba85a/?dl=1
+ - Name: hornet-base_3rdparty_in1k
+ Metadata:
+ FLOPs: 15582677376 # 15.59G
+ Parameters: 87256680 # 87.26M
+ In Collection: HorNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.24
+ Top 5 Accuracy: 96.94
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base_3rdparty_in1k_20220915-a06176bb.pth
+ Config: configs/hornet/hornet-base_8xb64_in1k.py
+ Converted From:
+ Code: https://github.com/raoyongming/HorNet
+ Weights: https://cloud.tsinghua.edu.cn/f/5c86cb3d655d4c17a959/?dl=1
+ - Name: hornet-base-gf_3rdparty_in1k
+ Metadata:
+ FLOPs: 15423308992 # 15.42G
+ Parameters: 88421352 # 88.42M
+ In Collection: HorNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.32
+ Top 5 Accuracy: 96.95
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base-gf_3rdparty_in1k_20220915-82c06fa7.pth
+ Config: configs/hornet/hornet-base-gf_8xb64_in1k.py
+ Converted From:
+ Code: https://github.com/raoyongming/HorNet
+ Weights: https://cloud.tsinghua.edu.cn/f/6c84935e63b547f383fb/?dl=1
diff --git a/configs/hrnet/README.md b/configs/hrnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..31725cf8a4e062552fcc7a0be60562885944924c
--- /dev/null
+++ b/configs/hrnet/README.md
@@ -0,0 +1,85 @@
+# HRNet
+
+> [Deep High-Resolution Representation Learning for Visual Recognition](https://arxiv.org/abs/1908.07919v2)
+
+
+
+## Abstract
+
+High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions *in series* (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams *in parallel*; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('hrnet-w18_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('hrnet-w18_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/hrnet/hrnet-w18_4xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-------------------------------: | :------------------------------------------------------------------------------: |
+| `hrnet-w18_3rdparty_8xb32_in1k`\* | From scratch | 21.30 | 4.33 | 76.75 | 93.44 | [config](hrnet-w18_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth) |
+| `hrnet-w30_3rdparty_8xb32_in1k`\* | From scratch | 37.71 | 8.17 | 78.19 | 94.22 | [config](hrnet-w30_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w30_3rdparty_8xb32_in1k_20220120-8aa3832f.pth) |
+| `hrnet-w32_3rdparty_8xb32_in1k`\* | From scratch | 41.23 | 8.99 | 78.44 | 94.19 | [config](hrnet-w32_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w32_3rdparty_8xb32_in1k_20220120-c394f1ab.pth) |
+| `hrnet-w40_3rdparty_8xb32_in1k`\* | From scratch | 57.55 | 12.77 | 78.94 | 94.47 | [config](hrnet-w40_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w40_3rdparty_8xb32_in1k_20220120-9a2dbfc5.pth) |
+| `hrnet-w44_3rdparty_8xb32_in1k`\* | From scratch | 67.06 | 14.96 | 78.88 | 94.37 | [config](hrnet-w44_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w44_3rdparty_8xb32_in1k_20220120-35d07f73.pth) |
+| `hrnet-w48_3rdparty_8xb32_in1k`\* | From scratch | 77.47 | 17.36 | 79.32 | 94.52 | [config](hrnet-w48_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32_in1k_20220120-e555ef50.pth) |
+| `hrnet-w64_3rdparty_8xb32_in1k`\* | From scratch | 128.06 | 29.00 | 79.46 | 94.65 | [config](hrnet-w64_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w64_3rdparty_8xb32_in1k_20220120-19126642.pth) |
+| `hrnet-w18_3rdparty_8xb32-ssld_in1k`\* | From scratch | 21.30 | 4.33 | 81.06 | 95.70 | [config](hrnet-w18_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32-ssld_in1k_20220120-455f69ea.pth) |
+| `hrnet-w48_3rdparty_8xb32-ssld_in1k`\* | From scratch | 77.47 | 17.36 | 83.63 | 96.79 | [config](hrnet-w48_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32-ssld_in1k_20220120-d0459c38.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/HRNet/HRNet-Image-Classification). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{WangSCJDZLMTWLX19,
+ title={Deep High-Resolution Representation Learning for Visual Recognition},
+ author={Jingdong Wang and Ke Sun and Tianheng Cheng and
+ Borui Jiang and Chaorui Deng and Yang Zhao and Dong Liu and Yadong Mu and
+ Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
+ journal={TPAMI},
+ year={2019}
+}
+```
diff --git a/configs/hrnet/hrnet-w18_4xb32_in1k.py b/configs/hrnet/hrnet-w18_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bc329a7e050131b01305d0209cc087c8f2daa24
--- /dev/null
+++ b/configs/hrnet/hrnet-w18_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+ '../_base_/models/hrnet/hrnet-w18.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w30_4xb32_in1k.py b/configs/hrnet/hrnet-w30_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..669a66b8cc7af8b8b394dba3f915f184e3b9d28f
--- /dev/null
+++ b/configs/hrnet/hrnet-w30_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+ '../_base_/models/hrnet/hrnet-w30.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w32_4xb32_in1k.py b/configs/hrnet/hrnet-w32_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e487403ffd242f4886962237a5bbfd57d6bbd62
--- /dev/null
+++ b/configs/hrnet/hrnet-w32_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+ '../_base_/models/hrnet/hrnet-w32.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w40_4xb32_in1k.py b/configs/hrnet/hrnet-w40_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1866a2a2b93d49164ebc8892342d11781a1ba9a5
--- /dev/null
+++ b/configs/hrnet/hrnet-w40_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+ '../_base_/models/hrnet/hrnet-w40.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w44_4xb32_in1k.py b/configs/hrnet/hrnet-w44_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ec913f7188151ea913f7ba324dc31845b1e9c11
--- /dev/null
+++ b/configs/hrnet/hrnet-w44_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+ '../_base_/models/hrnet/hrnet-w44.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w48_4xb32_in1k.py b/configs/hrnet/hrnet-w48_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0fc3f18ff03fafba4ff24d510546b6b0434c76c4
--- /dev/null
+++ b/configs/hrnet/hrnet-w48_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+ '../_base_/models/hrnet/hrnet-w48.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w64_4xb32_in1k.py b/configs/hrnet/hrnet-w64_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..659b3cd23ef16d953dc181d83016f955cd1570e0
--- /dev/null
+++ b/configs/hrnet/hrnet-w64_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+ '../_base_/models/hrnet/hrnet-w64.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/metafile.yml b/configs/hrnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..3a17b1251333c17b3b1c7834b46d15b4c43b8bd3
--- /dev/null
+++ b/configs/hrnet/metafile.yml
@@ -0,0 +1,162 @@
+Collections:
+ - Name: HRNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Batch Normalization
+ - Convolution
+ - ReLU
+ - Residual Connection
+ Paper:
+ URL: https://arxiv.org/abs/1908.07919v2
+ Title: "Deep High-Resolution Representation Learning for Visual Recognition"
+ README: configs/hrnet/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/hrnet.py
+ Version: v0.20.1
+
+Models:
+ - Name: hrnet-w18_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 4330397932
+ Parameters: 21295164
+ In Collection: HRNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.75
+ Top 5 Accuracy: 93.44
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth
+ Config: configs/hrnet/hrnet-w18_4xb32_in1k.py
+ Converted From:
+ Weights: https://1drv.ms/u/s!Aus8VCZ_C_33cMkPimlmClRvmpw
+ Code: https://github.com/HRNet/HRNet-Image-Classification
+ - Name: hrnet-w30_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 8168305684
+ Parameters: 37708380
+ In Collection: HRNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.19
+ Top 5 Accuracy: 94.22
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w30_3rdparty_8xb32_in1k_20220120-8aa3832f.pth
+ Config: configs/hrnet/hrnet-w30_4xb32_in1k.py
+ Converted From:
+ Weights: https://1drv.ms/u/s!Aus8VCZ_C_33cQoACCEfrzcSaVI
+ Code: https://github.com/HRNet/HRNet-Image-Classification
+ - Name: hrnet-w32_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 8986267584
+ Parameters: 41228840
+ In Collection: HRNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.44
+ Top 5 Accuracy: 94.19
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w32_3rdparty_8xb32_in1k_20220120-c394f1ab.pth
+ Config: configs/hrnet/hrnet-w32_4xb32_in1k.py
+ Converted From:
+ Weights: https://1drv.ms/u/s!Aus8VCZ_C_33dYBMemi9xOUFR0w
+ Code: https://github.com/HRNet/HRNet-Image-Classification
+ - Name: hrnet-w40_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 12767574064
+ Parameters: 57553320
+ In Collection: HRNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.94
+ Top 5 Accuracy: 94.47
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w40_3rdparty_8xb32_in1k_20220120-9a2dbfc5.pth
+ Config: configs/hrnet/hrnet-w40_4xb32_in1k.py
+ Converted From:
+ Weights: https://1drv.ms/u/s!Aus8VCZ_C_33ck0gvo5jfoWBOPo
+ Code: https://github.com/HRNet/HRNet-Image-Classification
+ - Name: hrnet-w44_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 14963902632
+ Parameters: 67061144
+ In Collection: HRNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.88
+ Top 5 Accuracy: 94.37
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w44_3rdparty_8xb32_in1k_20220120-35d07f73.pth
+ Config: configs/hrnet/hrnet-w44_4xb32_in1k.py
+ Converted From:
+ Weights: https://1drv.ms/u/s!Aus8VCZ_C_33czZQ0woUb980gRs
+ Code: https://github.com/HRNet/HRNet-Image-Classification
+ - Name: hrnet-w48_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 17364014752
+ Parameters: 77466024
+ In Collection: HRNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.32
+ Top 5 Accuracy: 94.52
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32_in1k_20220120-e555ef50.pth
+ Config: configs/hrnet/hrnet-w48_4xb32_in1k.py
+ Converted From:
+ Weights: https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk
+ Code: https://github.com/HRNet/HRNet-Image-Classification
+ - Name: hrnet-w64_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 29002298752
+ Parameters: 128056104
+ In Collection: HRNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.46
+ Top 5 Accuracy: 94.65
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w64_3rdparty_8xb32_in1k_20220120-19126642.pth
+ Config: configs/hrnet/hrnet-w64_4xb32_in1k.py
+ Converted From:
+ Weights: https://1drv.ms/u/s!Aus8VCZ_C_33gQbJsUPTIj3rQu99
+ Code: https://github.com/HRNet/HRNet-Image-Classification
+ - Name: hrnet-w18_3rdparty_8xb32-ssld_in1k
+ Metadata:
+ FLOPs: 4330397932
+ Parameters: 21295164
+ In Collection: HRNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.06
+ Top 5 Accuracy: 95.7
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32-ssld_in1k_20220120-455f69ea.pth
+ Config: configs/hrnet/hrnet-w18_4xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/HRNet/HRNet-Image-Classification/releases/download/PretrainedWeights/HRNet_W18_C_ssld_pretrained.pth
+ Code: https://github.com/HRNet/HRNet-Image-Classification
+ - Name: hrnet-w48_3rdparty_8xb32-ssld_in1k
+ Metadata:
+ FLOPs: 17364014752
+ Parameters: 77466024
+ In Collection: HRNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.63
+ Top 5 Accuracy: 96.79
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32-ssld_in1k_20220120-d0459c38.pth
+ Config: configs/hrnet/hrnet-w48_4xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/HRNet/HRNet-Image-Classification/releases/download/PretrainedWeights/HRNet_W48_C_ssld_pretrained.pth
+ Code: https://github.com/HRNet/HRNet-Image-Classification
diff --git a/configs/inception_v3/README.md b/configs/inception_v3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..24fde38118de66a642938d4d23f95ed5e5bfb412
--- /dev/null
+++ b/configs/inception_v3/README.md
@@ -0,0 +1,76 @@
+# Inception V3
+
+> [Rethinking the Inception Architecture for Computer Vision](http://arxiv.org/abs/1512.00567)
+
+
+
+## Abstract
+
+Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error on the validation set (3.6% error on the test set) and 17.3% top-1 error on the validation set.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('inception-v3_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('inception-v3_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/inception_v3/inception-v3_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :-----------------------------------------------------------------------------: |
+| `inception-v3_3rdparty_8xb32_in1k`\* | From scratch | 23.83 | 5.75 | 77.57 | 93.58 | [config](inception-v3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/inception.py#L28). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{szegedy2016rethinking,
+ title={Rethinking the inception architecture for computer vision},
+ author={Szegedy, Christian and Vanhoucke, Vincent and Ioffe, Sergey and Shlens, Jon and Wojna, Zbigniew},
+ booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+ pages={2818--2826},
+ year={2016}
+}
+```
diff --git a/configs/inception_v3/inception-v3_8xb32_in1k.py b/configs/inception_v3/inception-v3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac977f4edbeca55afc3de118162b95cf47f7c15e
--- /dev/null
+++ b/configs/inception_v3/inception-v3_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+ '../_base_/models/inception_v3.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py',
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=299),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=342, edge='short'),
+ dict(type='CenterCrop', crop_size=299),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/inception_v3/metafile.yml b/configs/inception_v3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..0b556deccf0d4ed4bc096d59338da061190ae62f
--- /dev/null
+++ b/configs/inception_v3/metafile.yml
@@ -0,0 +1,37 @@
+Collections:
+ - Name: Inception V3
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Epochs: 100
+ Batch Size: 256
+ Architecture:
+ - Inception
+ Paper:
+ URL: http://arxiv.org/abs/1512.00567
+ Title: "Rethinking the Inception Architecture for Computer Vision"
+ README: configs/inception_v3/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/inception_v3/metafile.yml
+ Version: v1.0.0rc1
+
+Models:
+ - Name: inception-v3_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 5745177632
+ Parameters: 23834568
+ In Collection: Inception V3
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.57
+ Top 5 Accuracy: 93.58
+ Weights: https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth
+ Config: configs/inception_v3/inception-v3_8xb32_in1k.py
+ Converted From:
+ Weights: https://download.pytorch.org/models/inception_v3_google-0cc3c7bd.pth
+ Code: https://github.com/pytorch/vision/blob/main/torchvision/models/inception.py#L28
diff --git a/configs/itpn/README.md b/configs/itpn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..93200d0224b64158961f68f9c0fcea0e4fb1da59
--- /dev/null
+++ b/configs/itpn/README.md
@@ -0,0 +1,65 @@
+# iTPN
+
+> [Integrally Pre-Trained Transformer Pyramid Networks](https://arxiv.org/abs/2211.12735)
+
+
+
+## Abstract
+
+In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2%/87.8% top-1 accuracy on ImageNet-1K, a 53.2%/55.6% box AP on COCO object detection with 1x training schedule using Mask-RCNN, and a 54.7%/57.7% mIoU on ADE20K semantic segmentation using UPerHead -- all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code and the pre-trained models will be released at https://github.com/sunsmarterjie/iTPN.
+
+
+

+
+
+## How to use it?
+
+
+
+
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :------------------------------------------------------ | :--------: | :-------: | :----------------------------------------------------------------: | :------: |
+| `itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k` | 233.00 | 18.47 | [config](itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py) | N/A |
+| `itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k` | 103.00 | 18.47 | [config](itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py) | N/A |
+| `itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k` | 314.00 | 63.98 | [config](itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py) | N/A |
+
+## Citation
+
+```bibtex
+@article{tian2022integrally,
+ title={Integrally Pre-Trained Transformer Pyramid Networks},
+ author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
+ journal={arXiv preprint arXiv:2211.12735},
+ year={2022}
+}
+```
diff --git a/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..40f35d9486e7b532dfd4904d94d379167222b62f
--- /dev/null
+++ b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,84 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs256_itpn.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='iTPN',
+ backbone=dict(
+ type='iTPNHiViT',
+ arch='base',
+ drop_path_rate=0.0,
+ rpe=True,
+ layer_scale_init_value=0.1,
+ reconstruction_type='clip'),
+ neck=dict(
+ type='iTPNPretrainDecoder',
+ patch_size=16,
+ in_chans=3,
+ embed_dim=512,
+ mlp_ratio=4.,
+ reconstruction_type='clip',
+ # transformer pyramid
+ fpn_dim=256,
+ fpn_depth=2,
+ num_outs=3,
+ ),
+ head=dict(
+ type='iTPNClipHead',
+ embed_dims=512,
+ num_embed=512,
+ loss=dict(type='CosineSimilarityLoss')),
+ target_generator=dict(
+ type='CLIPGenerator',
+ tokenizer_path= # noqa
+ 'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar' # noqa
+ ),
+)
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+ optimizer=dict(
+ type='AdamW', lr=1.5e-3, betas=(0.9, 0.98), weight_decay=0.05),
+ clip_grad=dict(max_norm=3.0),
+ paramwise_cfg=dict(
+ custom_keys={
+ '.norm': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0),
+ '.gamma': dict(decay_mult=0.0),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=10,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c624e7302924ea544ff2e347966956c4652e4f5
--- /dev/null
+++ b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py
@@ -0,0 +1,84 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs256_itpn.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='iTPN',
+ backbone=dict(
+ type='iTPNHiViT',
+ arch='base',
+ drop_path_rate=0.1,
+ rpe=True,
+ layer_scale_init_value=0.1,
+ reconstruction_type='clip'),
+ neck=dict(
+ type='iTPNPretrainDecoder',
+ patch_size=16,
+ in_chans=3,
+ embed_dim=512,
+ mlp_ratio=4.,
+ reconstruction_type='clip',
+ # transformer pyramid
+ fpn_dim=256,
+ fpn_depth=2,
+ num_outs=3,
+ ),
+ head=dict(
+ type='iTPNClipHead',
+ embed_dims=512,
+ num_embed=512,
+ loss=dict(type='CrossEntropyLoss')),
+ target_generator=dict(
+ type='CLIPGenerator',
+ tokenizer_path= # noqa
+ 'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar' # noqa
+ ),
+)
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 800/1600 epochs.
+ optimizer=dict(
+ type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+ clip_grad=dict(max_norm=3.0),
+ paramwise_cfg=dict(
+ custom_keys={
+ '.norm': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0),
+ '.gamma': dict(decay_mult=0.0),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=10,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d324a448fae9edd36fdcfa48c65829fa24a1be51
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/itpn_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=1560,
+ by_epoch=True,
+ begin=40,
+ end=1600,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c489dda9321774829fd5bf6e56de65603e177c6a
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/itpn_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=360,
+ by_epoch=True,
+ begin=40,
+ end=400,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebc5be011a816d23fb0d6ce801d43fd8f4019ae7
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/itpn_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=760,
+ by_epoch=True,
+ begin=40,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..359191bc84599016e33b7228a136a06db832b9ea
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/itpn_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='iTPNHiViT', arch='large'),
+ neck=dict(type='iTPNPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=1560,
+ by_epoch=True,
+ begin=40,
+ end=1600,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca4ba00b23789e1b31e57bb6d1078498a9375f7a
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/itpn_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='iTPNHiViT', arch='large'),
+ neck=dict(type='iTPNPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=360,
+ by_epoch=True,
+ begin=40,
+ end=400,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b1e298b0b97db3c4391dcda5adac4e01438fdfc9
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/itpn_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='iTPNHiViT', arch='large'),
+ neck=dict(type='iTPNPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=760,
+ by_epoch=True,
+ begin=40,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/metafile.yml b/configs/itpn/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..b8f5844de10f3df4114ba9eb655ed5baf844cb0e
--- /dev/null
+++ b/configs/itpn/metafile.yml
@@ -0,0 +1,50 @@
+Collections:
+ - Name: iTPN
+ Metadata:
+ Architecture:
+ - Dense Connections
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ Paper:
+ Title: 'Integrally Pre-Trained Transformer Pyramid Networks'
+ URL: https://arxiv.org/abs/2211.12735
+ README: configs/itpn/README.md
+ Code:
+ URL: null
+ Version: null
+
+Models:
+ - Name: itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k
+ Metadata:
+ FLOPs: 18474000000
+ Parameters: 233000000
+ Training Data:
+ - ImageNet-1k
+ In Collection: iTPN
+ Results: null
+ Weights:
+ Config: configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py
+
+ - Name: itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k
+ Metadata:
+ FLOPs: 18474000000
+ Parameters: 103000000
+ Training Data:
+ - ImageNet-1k
+ In Collection: iTPN
+ Results: null
+ Weights:
+ Config: configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
+
+ - Name: itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k
+ Metadata:
+ FLOPs: 63977000000
+ Parameters: 314000000
+ Training Data:
+ - ImageNet-1k
+ In Collection: iTPN
+ Results: null
+ Weights:
+ Config: configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
diff --git a/configs/lenet/README.md b/configs/lenet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2cd68eac42ed7fa1d0167fe1f7b9ad917e5ce735
--- /dev/null
+++ b/configs/lenet/README.md
@@ -0,0 +1,28 @@
+# LeNet
+
+> [Backpropagation Applied to Handwritten Zip Code Recognition](https://ieeexplore.ieee.org/document/6795724)
+
+
+
+## Abstract
+
+The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
+
+
+

+
+
+## Citation
+
+```
+@ARTICLE{6795724,
+ author={Y. {LeCun} and B. {Boser} and J. S. {Denker} and D. {Henderson} and R. E. {Howard} and W. {Hubbard} and L. D. {Jackel}},
+ journal={Neural Computation},
+ title={Backpropagation Applied to Handwritten Zip Code Recognition},
+ year={1989},
+ volume={1},
+ number={4},
+ pages={541-551},
+ doi={10.1162/neco.1989.1.4.541}}
+}
+```
diff --git a/configs/lenet/lenet5_mnist.py b/configs/lenet/lenet5_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ae8192548626c0073228a827d6b6b6595730a5e
--- /dev/null
+++ b/configs/lenet/lenet5_mnist.py
@@ -0,0 +1,89 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(type='LeNet5', num_classes=10),
+ neck=None,
+ head=dict(
+ type='ClsHead',
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# dataset settings
+dataset_type = 'MNIST'
+data_preprocessor = dict(mean=[33.46], std=[78.87], num_classes=10)
+
+pipeline = [dict(type='Resize', scale=32), dict(type='PackInputs')]
+
+common_data_cfg = dict(
+ type=dataset_type, data_prefix='data/mnist', pipeline=pipeline)
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=2,
+ dataset=dict(**common_data_cfg, test_mode=False),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+ batch_size=128,
+ num_workers=2,
+ dataset=dict(**common_data_cfg, test_mode=True),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))
+
+param_scheduler = dict(
+ type='MultiStepLR', # learning policy, decay on several milestones.
+ by_epoch=True, # update based on epoch.
+ milestones=[15], # decay at the 15th epochs.
+ gamma=0.1, # decay to 0.1 times.
+)
+
+train_cfg = dict(by_epoch=True, max_epochs=5, val_interval=1) # train 5 epochs
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+default_scope = 'mmpretrain'
+
+default_hooks = dict(
+ # record the time of every iteration.
+ timer=dict(type='IterTimerHook'),
+ # print log every 150 iterations.
+ logger=dict(type='LoggerHook', interval=150),
+ # enable the parameter scheduler.
+ param_scheduler=dict(type='ParamSchedulerHook'),
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1),
+ # set sampler seed in distributed evrionment.
+ sampler_seed=dict(type='DistSamplerSeedHook'),
+)
+
+env_cfg = dict(
+ # disable cudnn benchmark
+ cudnn_benchmark=False,
+ # set multi process parameters
+ mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+ # set distributed parameters
+ dist_cfg=dict(backend='nccl'),
+)
+
+log_level = 'INFO'
+
+# load from which checkpoint
+load_from = None
+
+# whether to resume the training of the checkpoint
+resume_from = None
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (1 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/levit/README.md b/configs/levit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..234edb60618b3edd61cb01c0c172513011b1b042
--- /dev/null
+++ b/configs/levit/README.md
@@ -0,0 +1,81 @@
+# LeViT
+
+> [LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference](https://arxiv.org/abs/2104.01136)
+
+
+
+## Abstract
+
+We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('levit-128s_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('levit-128s_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/levit/levit-128s_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/levit/levit-128s_3rdparty_in1k_20230117-e9fbd209.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :--------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :--------------------------------------------------------------------------------------: |
+| `levit-128s_3rdparty_in1k`\* | From scratch | 7.39 | 0.31 | 76.51 | 92.90 | [config](levit-128s_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-128s_3rdparty_in1k_20230117-e9fbd209.pth) |
+| `levit-128_3rdparty_in1k`\* | From scratch | 8.83 | 0.41 | 78.58 | 93.95 | [config](levit-128_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-128_3rdparty_in1k_20230117-3be02a02.pth) |
+| `levit-192_3rdparty_in1k`\* | From scratch | 10.56 | 0.67 | 79.86 | 94.75 | [config](levit-192_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-192_3rdparty_in1k_20230117-8217a0f9.pth) |
+| `levit-256_3rdparty_in1k`\* | From scratch | 18.38 | 1.14 | 81.59 | 95.46 | [config](levit-256_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-256_3rdparty_in1k_20230117-5ae2ce7d.pth) |
+| `levit-384_3rdparty_in1k`\* | From scratch | 38.36 | 2.37 | 82.59 | 95.95 | [config](levit-384_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-384_3rdparty_in1k_20230117-f3539cce.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/LeViT). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@InProceedings{Graham_2021_ICCV,
+ author = {Graham, Benjamin and El-Nouby, Alaaeldin and Touvron, Hugo and Stock, Pierre and Joulin, Armand and Jegou, Herve and Douze, Matthijs},
+ title = {LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference},
+ booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+ month = {October},
+ year = {2021},
+ pages = {12259-12269}
+}
+```
diff --git a/configs/levit/deploy/levit-128_8xb256_in1k.py b/configs/levit/deploy/levit-128_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab58119395339cc11a4cb09caad1ea0cb6c7ae3b
--- /dev/null
+++ b/configs/levit/deploy/levit-128_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-128_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-128s_8xb256_in1k.py b/configs/levit/deploy/levit-128s_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..93ebc3724714b73b362bc12de1b9029040cbc4f6
--- /dev/null
+++ b/configs/levit/deploy/levit-128s_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-128s_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-192_8xb256_in1k.py b/configs/levit/deploy/levit-192_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..34249fda74d97b4f1e591cd39722b9cbdd94d3d2
--- /dev/null
+++ b/configs/levit/deploy/levit-192_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-192_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-256_8xb256_in1k.py b/configs/levit/deploy/levit-256_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..687f83506e30fcf36041729b70b30822b30cae81
--- /dev/null
+++ b/configs/levit/deploy/levit-256_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-256_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-384_8xb256_in1k.py b/configs/levit/deploy/levit-384_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a83d47a54507022389bfb34c50ae466c978586b
--- /dev/null
+++ b/configs/levit/deploy/levit-384_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-384_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/levit-128_8xb256_in1k.py b/configs/levit/levit-128_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdec48e3ffbb317ae464be244bf8e05cf4c41165
--- /dev/null
+++ b/configs/levit/levit-128_8xb256_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/levit-256-p16.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(arch='128'), head=dict(in_channels=384))
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-128s_8xb256_in1k.py b/configs/levit/levit-128s_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0564cac7e018ec4e311f5e970e9211260ada402c
--- /dev/null
+++ b/configs/levit/levit-128s_8xb256_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/levit-256-p16.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(arch='128s'), head=dict(in_channels=384))
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-192_8xb256_in1k.py b/configs/levit/levit-192_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfbf70e0ad2f0a35e4acca090bd6d2cadd6932f0
--- /dev/null
+++ b/configs/levit/levit-192_8xb256_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/levit-256-p16.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(arch='192'), head=dict(in_channels=384))
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-256_8xb256_in1k.py b/configs/levit/levit-256_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e961e776faf923f7acceef8b2578f86e7f630afa
--- /dev/null
+++ b/configs/levit/levit-256_8xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/levit-256-p16.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-384_8xb256_in1k.py b/configs/levit/levit-384_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..10ceac45c4cc907165d75c6b1b320c07f9a384e9
--- /dev/null
+++ b/configs/levit/levit-384_8xb256_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+ '../_base_/models/levit-256-p16.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(arch='384', drop_path_rate=0.1),
+ head=dict(in_channels=768),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/metafile.yml b/configs/levit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..78b62c5c12dcee63d1790b597f0222d7f8324361
--- /dev/null
+++ b/configs/levit/metafile.yml
@@ -0,0 +1,101 @@
+Collections:
+ - Name: LeViT
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - 1x1 Convolution
+ - LeViT Attention Block
+ Paper:
+ Title: "LeViT: a Vision Transformer in ConvNet\u2019s Clothing for Faster Inference"
+ URL: https://arxiv.org/abs/2104.01136
+ README: configs/levit/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/levit.py
+ Version: v1.0.0rc5
+
+Models:
+ - Name: levit-128s_3rdparty_in1k
+ Metadata:
+ FLOPs: 310342496
+ Parameters: 7391290
+ Training Data: ImageNet-1k
+ In Collection: LeViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.51
+ Top 5 Accuracy: 92.90
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-128s_3rdparty_in1k_20230117-e9fbd209.pth
+ Config: configs/levit/levit-128s_8xb256_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-128S-96703c44.pth
+ Code: https://github.com/facebookresearch/LeViT
+ - Name: levit-128_3rdparty_in1k
+ Metadata:
+ FLOPs: 413060992
+ Parameters: 8828168
+ Training Data: ImageNet-1k
+ In Collection: LeViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.58
+ Top 5 Accuracy: 93.95
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-128_3rdparty_in1k_20230117-3be02a02.pth
+ Config: configs/levit/levit-128_8xb256_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-128-b88c2750.pth
+ Code: https://github.com/facebookresearch/LeViT
+ - Name: levit-192_3rdparty_in1k
+ Metadata:
+ FLOPs: 667860704
+ Parameters: 10561301
+ Training Data: ImageNet-1k
+ In Collection: LeViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.86
+ Top 5 Accuracy: 94.75
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-192_3rdparty_in1k_20230117-8217a0f9.pth
+ Config: configs/levit/levit-192_8xb256_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-192-92712e41.pth
+ Code: https://github.com/facebookresearch/LeViT
+ - Name: levit-256_3rdparty_in1k
+ Metadata:
+ FLOPs: 1141625216
+ Parameters: 18379852
+ Training Data: ImageNet-1k
+ In Collection: LeViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.59
+ Top 5 Accuracy: 95.46
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-256_3rdparty_in1k_20230117-5ae2ce7d.pth
+ Config: configs/levit/levit-256_8xb256_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-256-13b5763e.pth
+ Code: https://github.com/facebookresearch/LeViT
+ - Name: levit-384_3rdparty_in1k
+ Metadata:
+ FLOPs: 2372941568
+ Parameters: 38358300
+ Training Data: ImageNet-1k
+ In Collection: LeViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.59
+ Top 5 Accuracy: 95.95
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-384_3rdparty_in1k_20230117-f3539cce.pth
+ Config: configs/levit/levit-384_8xb256_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-384-9bdaf2e2.pth
+ Code: https://github.com/facebookresearch/LeViT
diff --git a/configs/llava/README.md b/configs/llava/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..581abfe5a66c30ce9ff1062d2fe605e17bb2f501
--- /dev/null
+++ b/configs/llava/README.md
@@ -0,0 +1,51 @@
+# LLaVA
+
+> [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485)
+
+
+
+## Abstract
+
+Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model, inference_model
+
+out = inference_model('llava-7b-v1_caption', 'demo/cat-dog.png', device='cuda')
+print(out)
+# {'pred_caption': 'In the image, there are two cats sitting on a blanket.'}
+```
+
+
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model | Params (M) | Config | Download |
+| :---------------------- | :--------: | :--------------------------------: | :-------------------------------------------------------------------------------------------------------------: |
+| `llava-7b-v1_caption` | 7045.82 | [config](llava-7b-v1_caption.py) | [ckpt](https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1_liuhaotian_20231025-c9e119b6.pth) |
+| `llava-7b-v1.5_caption` | 7062.90 | [config](llava-7b-v1.5_caption.py) | [ckpt](https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth) |
+| `llava-7b-v1.5_vqa` | 7062.90 | [config](llava-7b-v1.5_vqa.py) | [ckpt](https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth) |
+
+## Citation
+
+```bibtex
+@misc{liu2023llava,
+ title={Visual Instruction Tuning},
+ author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
+ publisher={arXiv:2304.08485},
+ year={2023},
+}
+```
diff --git a/configs/llava/llava-7b-v1.5_caption.py b/configs/llava/llava-7b-v1.5_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..371c9b5f6174416ade8708b9c74bc7f684f2af8c
--- /dev/null
+++ b/configs/llava/llava-7b-v1.5_caption.py
@@ -0,0 +1,76 @@
+_base_ = '../_base_/default_runtime.py'
+
+meta_prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions." # noqa: E501
+image_size = 336
+prompt_tmpl = f'''{meta_prompt} User:
+Describe the image in detail. ASSISTANT:'''
+
+# model settings
+model = dict(
+ type='Llava',
+ tokenizer=dict(
+ type='AutoTokenizer', name_or_path='liuhaotian/llava-v1.5-7b'),
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='l',
+ patch_size=14,
+ img_size=image_size,
+ pre_norm=True,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+ final_norm=False,
+ out_type='raw',
+ pretrained='https://download.openmmlab.com/mmclassification/v0/clip/'
+ 'vit-large-p14_clip-openai-pre_336px_20231025-fb1315ed.pth',
+ ),
+ mm_hidden_size=1024,
+ use_im_patch=False,
+ use_im_start_end=False,
+ mm_proj_depth=2,
+ lang_encoder=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='huggyllama/llama-7b',
+ ),
+ task='caption',
+ prompt_tmpl=prompt_tmpl,
+ generation_cfg=dict(num_beams=3, max_new_tokens=50, length_penalty=-1.0),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(image_size, image_size),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_dataloader = dict(
+ batch_size=8,
+ num_workers=5,
+ dataset=dict(
+ type='COCOCaption',
+ data_root='data/coco',
+ ann_file='annotations/coco_karpathy_val.json',
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+test_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# schedule settings
+test_cfg = dict()
diff --git a/configs/llava/llava-7b-v1.5_vqa.py b/configs/llava/llava-7b-v1.5_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cb9812cd98b207c96b44da8261f4a11b4f04691
--- /dev/null
+++ b/configs/llava/llava-7b-v1.5_vqa.py
@@ -0,0 +1,76 @@
+_base_ = '../_base_/default_runtime.py'
+
+meta_prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions." # noqa: E501
+image_size = 336
+prompt_tmpl = f'''{meta_prompt} User:
+{{question}} ASSISTANT:'''
+
+# model settings
+model = dict(
+ type='Llava',
+ tokenizer=dict(
+ type='AutoTokenizer', name_or_path='liuhaotian/llava-v1.5-7b'),
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='l',
+ patch_size=14,
+ img_size=image_size,
+ pre_norm=True,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+ final_norm=False,
+ out_type='raw',
+ pretrained='https://download.openmmlab.com/mmclassification/v0/clip/'
+ 'vit-large-p14_clip-openai-pre_336px_20231025-fb1315ed.pth',
+ ),
+ mm_hidden_size=1024,
+ use_im_patch=False,
+ use_im_start_end=False,
+ mm_proj_depth=2,
+ lang_encoder=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='huggyllama/llama-7b',
+ ),
+ task='vqa',
+ prompt_tmpl=prompt_tmpl,
+ generation_cfg=dict(max_new_tokens=100),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(image_size, image_size),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='PackInputs', meta_keys=['image_id', 'question']),
+]
+
+test_dataloader = dict(
+ batch_size=8,
+ num_workers=5,
+ dataset=dict(
+ type='COCOCaption',
+ data_root='data/coco',
+ ann_file='annotations/coco_karpathy_val.json',
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+test_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# schedule settings
+test_cfg = dict()
diff --git a/configs/llava/llava-7b-v1_caption.py b/configs/llava/llava-7b-v1_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..92e2d1fb65aab218a2c285c8d97b9f8886681304
--- /dev/null
+++ b/configs/llava/llava-7b-v1_caption.py
@@ -0,0 +1,78 @@
+_base_ = '../_base_/default_runtime.py'
+
+meta_prompt = 'You are LLaVA, a large language and vision assistant trained by UW Madison WAIV Lab.You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.Follow the instructions carefully and explain your answers in detail.' # noqa: E501
+image_size = 224
+prompt_tmpl = f'''{meta_prompt} User:
+Describe the image in detail. ASSISTANT:'''
+
+# model settings
+model = dict(
+ type='Llava',
+ tokenizer=dict(
+ type='AutoTokenizer',
+ name_or_path='liuhaotian/LLaVA-Lightning-7B-delta-v1-1'),
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='l',
+ patch_size=14,
+ img_size=image_size,
+ pre_norm=True,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+ final_norm=False,
+ out_type='raw',
+ pretrained=(
+ 'https://download.openmmlab.com/mmclassification/v0/clip/'
+ 'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+ ),
+ mm_hidden_size=1024,
+ use_im_patch=False,
+ use_im_start_end=True,
+ mm_proj_depth=1,
+ lang_encoder=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='huggyllama/llama-7b',
+ ),
+ task='caption',
+ prompt_tmpl=prompt_tmpl,
+ generation_cfg=dict(max_new_tokens=50),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(image_size, image_size),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_dataloader = dict(
+ batch_size=8,
+ num_workers=5,
+ dataset=dict(
+ type='COCOCaption',
+ data_root='data/coco',
+ ann_file='annotations/coco_karpathy_val.json',
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+test_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# schedule settings
+test_cfg = dict()
diff --git a/configs/llava/metafile.yml b/configs/llava/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..406a214c33a5d8a3d1e2b73cfebd51975a27071e
--- /dev/null
+++ b/configs/llava/metafile.yml
@@ -0,0 +1,51 @@
+Collections:
+ - Name: LLaVA
+ Metadata:
+ Architecture:
+ - LLaMA
+ - CLIP
+ Paper:
+ Title: Visual Instruction Tuning
+ URL: https://arxiv.org/abs/2304.08485
+ README: configs/llava/README.md
+
+Models:
+ - Name: llava-7b-v1_caption
+ Metadata:
+ FLOPs: null
+ Parameters: 7045816320
+ In Collection: LLaVA
+ Results:
+ - Task: Image Caption
+ Dataset: COCO
+ Metrics:
+ BLEU-4: null
+ CIDER: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1_liuhaotian_20231025-c9e119b6.pth
+ Config: configs/llava/llava-7b-v1_caption.py
+ - Name: llava-7b-v1.5_caption
+ Metadata:
+ FLOPs: null
+ Parameters: 7062900736
+ In Collection: LLaVA
+ Results:
+ - Task: Image Caption
+ Dataset: COCO
+ Metrics:
+ BLEU-4: null
+ CIDER: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth
+ Config: configs/llava/llava-7b-v1.5_caption.py
+ - Name: llava-7b-v1.5_vqa
+ Metadata:
+ FLOPs: null
+ Parameters: 7062900736
+ In Collection: LLaVA
+ Results:
+ - Task: Visual Question Answering
+ Dataset: COCO
+ Metrics:
+ BLEU-4: null
+ CIDER: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth
+ Config: configs/llava/llava-7b-v1.5_vqa.py
diff --git a/configs/mae/README.md b/configs/mae/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..69f5f9bf35f9aa4bbe3097c58256496445f864dd
--- /dev/null
+++ b/configs/mae/README.md
@@ -0,0 +1,123 @@
+# MAE
+
+> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
+
+
+
+## Abstract
+
+This paper shows that masked autoencoders (MAE) are
+scalable self-supervised learners for computer vision. Our
+MAE approach is simple: we mask random patches of the
+input image and reconstruct the missing pixels. It is based
+on two core designs. First, we develop an asymmetric
+encoder-decoder architecture, with an encoder that operates only on the
+visible subset of patches (without mask tokens), along with a lightweight
+decoder that reconstructs the original image from the latent representation
+and mask tokens. Second, we find that masking a high proportion
+of the input image, e.g., 75%, yields a nontrivial and
+meaningful self-supervisory task. Coupling these two designs enables us to
+train large models efficiently and effectively: we accelerate
+training (by 3× or more) and improve accuracy. Our scalable approach allows
+for learning high-capacity models that generalize well: e.g., a vanilla
+ViT-Huge model achieves the best accuracy (87.8%) among
+methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mae_vit-base-p16_8xb512-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py None
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :---------------------------------------------- | :--------: | :-------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------: |
+| `mae_vit-base-p16_8xb512-amp-coslr-300e_in1k` | 111.91 | 17.58 | [config](mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.json) |
+| `mae_vit-base-p16_8xb512-amp-coslr-400e_in1k` | 111.91 | 17.58 | [config](mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.json) |
+| `mae_vit-base-p16_8xb512-amp-coslr-800e_in1k` | 111.91 | 17.58 | [config](mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.json) |
+| `mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k` | 111.91 | 17.58 | [config](mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.json) |
+| `mae_vit-large-p16_8xb512-amp-coslr-400e_in1k` | 329.54 | 61.60 | [config](mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.json) |
+| `mae_vit-large-p16_8xb512-amp-coslr-800e_in1k` | 329.54 | 61.60 | [config](mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.json) |
+| `mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k` | 329.54 | 61.60 | [config](mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.json) |
+| `mae_vit-huge-p16_8xb512-amp-coslr-1600e_in1k` | 657.07 | 167.40 | [config](mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k` | [MAE 300-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth) | 86.57 | 17.58 | 83.10 | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | N/A |
+| `vit-base-p16_mae-400e-pre_8xb128-coslr-100e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth) | 86.57 | 17.58 | 83.30 | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | N/A |
+| `vit-base-p16_mae-800e-pre_8xb128-coslr-100e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth) | 86.57 | 17.58 | 83.30 | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | N/A |
+| `vit-base-p16_mae-1600e-pre_8xb128-coslr-100e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) | 86.57 | 17.58 | 83.50 | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.json) |
+| `vit-base-p16_mae-300e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 300-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth) | 86.57 | 17.58 | 60.80 | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) | N/A |
+| `vit-base-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth) | 86.57 | 17.58 | 62.50 | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) | N/A |
+| `vit-base-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth) | 86.57 | 17.58 | 65.10 | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) | N/A |
+| `vit-base-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) | 86.57 | 17.58 | 67.10 | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) | N/A |
+| `vit-large-p16_mae-400e-pre_8xb128-coslr-50e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth) | 304.32 | 61.60 | 85.20 | [config](benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py) | N/A |
+| `vit-large-p16_mae-800e-pre_8xb128-coslr-50e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth) | 304.32 | 61.60 | 85.40 | [config](benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py) | N/A |
+| `vit-large-p16_mae-1600e-pre_8xb128-coslr-50e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth) | 304.32 | 61.60 | 85.70 | [config](benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py) | N/A |
+| `vit-large-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth) | 304.33 | 61.60 | 70.70 | [config](benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py) | N/A |
+| `vit-large-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth) | 304.33 | 61.60 | 73.70 | [config](benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py) | N/A |
+| `vit-large-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth) | 304.33 | 61.60 | 75.50 | [config](benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py) | N/A |
+| `vit-huge-p14_mae-1600e-pre_8xb128-coslr-50e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth) | 632.04 | 167.40 | 86.90 | [config](benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k_20220916-0bfc9bfd.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k_20220916-0bfc9bfd.json) |
+| `vit-huge-p14_mae-1600e-pre_32xb8-coslr-50e_in1k-448px` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth) | 633.03 | 732.13 | 87.30 | [config](benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448_20220916-95b6a0ce.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448_20220916-95b6a0ce.json) |
+
+## Citation
+
+```bibtex
+@article{He2021MaskedAA,
+ title={Masked Autoencoders Are Scalable Vision Learners},
+ author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and
+ Piotr Doll'ar and Ross B. Girshick},
+ journal={arXiv},
+ year={2021}
+}
+```
diff --git a/configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cf9ca1134766cd3b0179b7581511cd94dedbbc2
--- /dev/null
+++ b/configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.1,
+ out_type='avg_featmap',
+ final_norm=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=2e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.65,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ by_epoch=True,
+ begin=5,
+ end=100,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py b/configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0545c99d002925886349c7979ab0722fbf8f37a
--- /dev/null
+++ b/configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
@@ -0,0 +1,64 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ frozen_stages=12,
+ out_type='cls_token',
+ final_norm=True,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=dict(type='ClsBatchNormNeck', input_features=768),
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+ _delete_=True,
+ type='AmpOptimWrapper',
+ optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=80,
+ by_epoch=True,
+ begin=10,
+ end=90,
+ eta_min=0.0,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+ logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py b/configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..60046b48d49f2bcc74a672c7b615da3062ad829b
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py
@@ -0,0 +1,116 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=448,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=512,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=448),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='huge',
+ img_size=448,
+ patch_size=14,
+ drop_path_rate=0.3, # set to 0.3
+ out_type='avg_featmap',
+ final_norm=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer wrapper
+# learning rate and layer decay rate are set to 0.004 and 0.75 respectively
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.75,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=45,
+ by_epoch=True,
+ begin=5,
+ end=50,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a9ff51890be80c6070058b2dd3e837027864da5
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py
@@ -0,0 +1,115 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='huge',
+ img_size=224,
+ patch_size=14,
+ drop_path_rate=0.3, # set to 0.3
+ out_type='avg_featmap',
+ final_norm=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1280,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer wrapper
+# learning rate and layer decay rate are set to 0.004 and 0.75 respectively
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.75,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=45,
+ by_epoch=True,
+ begin=5,
+ end=50,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-huge-p14_8xb128-ds-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-huge-p14_8xb128-ds-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..813f7c03f300e1579b2ca036995b1a78135f2293
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_8xb128-ds-coslr-50e_in1k.py
@@ -0,0 +1,31 @@
+_base_ = ['./vit-huge-p14_8xb128-coslr-50e_in1k.py']
+
+# optimizer wrapper
+optim_wrapper = dict(type='DeepSpeedOptimWrapper')
+
+# training strategy
+strategy = dict(
+ type='DeepSpeedStrategy',
+ fp16=dict(
+ enabled=True,
+ fp16_master_weights_and_grads=False,
+ loss_scale=0,
+ loss_scale_window=500,
+ hysteresis=2,
+ min_loss_scale=1,
+ initial_scale_power=15,
+ ),
+ inputs_to_half=['inputs'],
+ zero_optimization=dict(
+ stage=1,
+ allgather_partitions=True,
+ reduce_scatter=True,
+ allgather_bucket_size=50000000,
+ reduce_bucket_size=50000000,
+ overlap_comm=True,
+ contiguous_gradients=True,
+ cpu_offload=False,
+ ))
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-huge-p14_8xb128-fsdp-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-huge-p14_8xb128-fsdp-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f8dfb760f3e0282a5efce7bd9322ca381a802c2
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_8xb128-fsdp-coslr-50e_in1k.py
@@ -0,0 +1,13 @@
+_base_ = ['./vit-huge-p14_8xb128-coslr-50e_in1k.py']
+
+strategy = dict(
+ type='FSDPStrategy',
+ model_wrapper=dict(
+ auto_wrap_policy=dict(
+ type='torch.distributed.fsdp.wrap.size_based_auto_wrap_policy',
+ min_num_params=1e7)))
+
+optim_wrapper = dict(type='AmpOptimWrapper')
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae86b40b8a262bc9f33e523afd161fdb014971bd
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
@@ -0,0 +1,115 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='large',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.2, # set to 0.2
+ out_type='avg_featmap',
+ final_norm=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer wrapper
+# learning rate and layer decay rate are set to 0.004 and 0.75 respectively
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.75,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=45,
+ by_epoch=True,
+ begin=5,
+ end=50,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb128-ds-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb128-ds-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9aedb431c5521f725912983444523f25340eac2a
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb128-ds-coslr-50e_in1k.py
@@ -0,0 +1,31 @@
+_base_ = ['./vit-large-p16_8xb128-coslr-50e_in1k.py']
+
+# optimizer wrapper
+optim_wrapper = dict(type='DeepSpeedOptimWrapper')
+
+# training strategy
+strategy = dict(
+ type='DeepSpeedStrategy',
+ fp16=dict(
+ enabled=True,
+ fp16_master_weights_and_grads=False,
+ loss_scale=0,
+ loss_scale_window=500,
+ hysteresis=2,
+ min_loss_scale=1,
+ initial_scale_power=15,
+ ),
+ inputs_to_half=['inputs'],
+ zero_optimization=dict(
+ stage=1,
+ allgather_partitions=True,
+ reduce_scatter=True,
+ allgather_bucket_size=50000000,
+ reduce_bucket_size=50000000,
+ overlap_comm=True,
+ contiguous_gradients=True,
+ cpu_offload=False,
+ ))
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb128-fsdp-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb128-fsdp-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a8a67401eb3bb7204521d6ff97603eebc7e00c9
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb128-fsdp-coslr-50e_in1k.py
@@ -0,0 +1,13 @@
+_base_ = ['./vit-large-p16_8xb128-coslr-50e_in1k.py']
+
+strategy = dict(
+ type='FSDPStrategy',
+ model_wrapper=dict(
+ auto_wrap_policy=dict(
+ type='torch.distributed.fsdp.wrap.size_based_auto_wrap_policy',
+ min_num_params=1e7)))
+
+optim_wrapper = dict(type='AmpOptimWrapper')
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c89518141c148161b2dbf082aa7b0a2eb0843539
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
@@ -0,0 +1,64 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='large',
+ img_size=224,
+ patch_size=16,
+ frozen_stages=24,
+ out_type='cls_token',
+ final_norm=True,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=dict(type='ClsBatchNormNeck', input_features=1024),
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(type='CrossEntropyLoss'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+ _delete_=True,
+ type='AmpOptimWrapper',
+ optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=80,
+ by_epoch=True,
+ begin=10,
+ end=90,
+ eta_min=0.0,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+ logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..76c0df22b7bc5ac52dd50ebdaf4b141efa20352f
--- /dev/null
+++ b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/mae_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=1560,
+ by_epoch=True,
+ begin=40,
+ end=1600,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8107fccb5c5c18df90cda43cccf21cb7b86f5245
--- /dev/null
+++ b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/mae_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=360,
+ by_epoch=True,
+ begin=40,
+ end=400,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c150e0412b2092ec7a137bd3e488cea00ef2fc7f
--- /dev/null
+++ b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/mae_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=760,
+ by_epoch=True,
+ begin=40,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d5e40db5478755f751f4dd1c989d0c5906ca1d7
--- /dev/null
+++ b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/mae_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='MAEHiViT', arch='large'),
+ neck=dict(type='MAEPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=1560,
+ by_epoch=True,
+ begin=40,
+ end=1600,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c6c47d08fdfa676dd30f628fa06c60595434f85
--- /dev/null
+++ b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/mae_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='MAEHiViT', arch='large'),
+ neck=dict(type='MAEPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=360,
+ by_epoch=True,
+ begin=40,
+ end=400,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ed7d207a135264f9a1c20863fbf80d493f6f678
--- /dev/null
+++ b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/mae_hivit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='MAEHiViT', arch='large'),
+ neck=dict(type='MAEPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=760,
+ by_epoch=True,
+ begin=40,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbad841818f0a96ab233b96820446c7b0d72de4a
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=1560,
+ by_epoch=True,
+ begin=40,
+ end=1600,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f11fb2fa98c55034a7fa3397ea337044e43f3358
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=260,
+ by_epoch=True,
+ begin=40,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8f0398356cc8c1302d9739d73b88bec0bab3b92
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=360,
+ by_epoch=True,
+ begin=40,
+ end=400,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..01e0fb423969642174ac38d19a57e0db5c6cfc61
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.000000001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=760,
+ by_epoch=True,
+ begin=40,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5eb7a427eb0a7cfcf2da5cbc85aa1ca89d82d152
--- /dev/null
+++ b/configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,66 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='MAEViT', arch='h', patch_size=14),
+ neck=dict(
+ type='MAEPretrainDecoder',
+ embed_dim=1280,
+ patch_size=14,
+ num_patches=256),
+ head=dict(patch_size=14))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=1560,
+ by_epoch=True,
+ begin=40,
+ end=1600,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..683790c0c9a80c532e0865627f48e313b3fc6595
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='MAEViT', arch='l'),
+ neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=1560,
+ by_epoch=True,
+ begin=40,
+ end=1600,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-300e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..539207466d25617946b2dde38612587da2b6f30e
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='MAEViT', arch='l'),
+ neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=260,
+ by_epoch=True,
+ begin=40,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f050522a2209fea0feaa2a594e10900fca47f006
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='MAEViT', arch='l'),
+ neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=360,
+ by_epoch=True,
+ begin=40,
+ end=400,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a4294db3275a405357c08b09c07f5672faa4adc
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(type='MAEViT', arch='l'),
+ neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.000000001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=760,
+ by_epoch=True,
+ begin=40,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/metafile.yml b/configs/mae/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8192672305de26ee20d00e1a59ad3180322491ed
--- /dev/null
+++ b/configs/mae/metafile.yml
@@ -0,0 +1,367 @@
+Collections:
+ - Name: MAE
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ Training Resources: 8x A100-80G GPUs
+ Architecture:
+ - ViT
+ Paper:
+ Title: Masked Autoencoders Are Scalable Vision Learners
+ URL: https://arxiv.org/abs/2111.06377
+ README: configs/mae/README.md
+
+Models:
+ - Name: mae_vit-base-p16_8xb512-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 4096
+ FLOPs: 17581972224
+ Parameters: 111907840
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth
+ Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+ Downstream:
+ - vit-base-p16_mae-300e-pre_8xb2048-linear-coslr-90e_in1k
+ - vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k
+ - Name: mae_vit-base-p16_8xb512-amp-coslr-400e_in1k
+ Metadata:
+ Epochs: 400
+ Batch Size: 4096
+ FLOPs: 17581972224
+ Parameters: 111907840
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth
+ Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py
+ Downstream:
+ - vit-base-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+ - vit-base-p16_mae-400e-pre_8xb128-coslr-100e_in1k
+ - Name: mae_vit-base-p16_8xb512-amp-coslr-800e_in1k
+ Metadata:
+ Epochs: 800
+ Batch Size: 4096
+ FLOPs: 17581972224
+ Parameters: 111907840
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth
+ Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
+ Downstream:
+ - vit-base-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+ - vit-base-p16_mae-800e-pre_8xb128-coslr-100e_in1k
+ - Name: mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k
+ Metadata:
+ Epochs: 1600
+ Batch Size: 4096
+ FLOPs: 17581972224
+ Parameters: 111907840
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth
+ Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py
+ Downstream:
+ - vit-base-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+ - vit-base-p16_mae-1600e-pre_8xb128-coslr-100e_in1k
+ - Name: mae_vit-large-p16_8xb512-amp-coslr-400e_in1k
+ Metadata:
+ Epochs: 400
+ Batch Size: 4096
+ FLOPs: 61603111936
+ Parameters: 329541888
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth
+ Config: configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py
+ Downstream:
+ - vit-large-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+ - vit-large-p16_mae-400e-pre_8xb128-coslr-50e_in1k
+ - Name: mae_vit-large-p16_8xb512-amp-coslr-800e_in1k
+ Metadata:
+ Epochs: 800
+ Batch Size: 4096
+ FLOPs: 61603111936
+ Parameters: 329541888
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth
+ Config: configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py
+ Downstream:
+ - vit-large-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+ - vit-large-p16_mae-800e-pre_8xb128-coslr-50e_in1k
+ - Name: mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k
+ Metadata:
+ Epochs: 1600
+ Batch Size: 4096
+ FLOPs: 61603111936
+ Parameters: 329541888
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth
+ Config: configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
+ Downstream:
+ - vit-large-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+ - vit-large-p16_mae-1600e-pre_8xb128-coslr-50e_in1k
+ - Name: mae_vit-huge-p16_8xb512-amp-coslr-1600e_in1k
+ Metadata:
+ Epochs: 1600
+ Batch Size: 4096
+ FLOPs: 167400741120
+ Parameters: 657074508
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth
+ Config: configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py
+ Downstream:
+ - vit-huge-p14_mae-1600e-pre_8xb128-coslr-50e_in1k
+ - vit-huge-p14_mae-1600e-pre_32xb8-coslr-50e_in1k-448px
+ - Name: vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.1
+ Weights: null
+ Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: vit-base-p16_mae-400e-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.3
+ Weights: null
+ Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: vit-base-p16_mae-800e-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.3
+ Weights: null
+ Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: vit-base-p16_mae-1600e-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.5
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth
+ Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: vit-base-p16_mae-300e-pre_8xb2048-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 16384
+ FLOPs: 17581972992
+ Parameters: 86567656
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 60.8
+ Weights: null
+ Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+ - Name: vit-base-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 16384
+ FLOPs: 17581972992
+ Parameters: 86567656
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 62.5
+ Weights: null
+ Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+ - Name: vit-base-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 16384
+ FLOPs: 17581972992
+ Parameters: 86567656
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 65.1
+ Weights: null
+ Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+ - Name: vit-base-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 16384
+ FLOPs: 17581972992
+ Parameters: 86567656
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 67.1
+ Weights: null
+ Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+ - Name: vit-large-p16_mae-400e-pre_8xb128-coslr-50e_in1k
+ Metadata:
+ Epochs: 50
+ Batch Size: 1024
+ FLOPs: 61602103296
+ Parameters: 304324584
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.2
+ Weights: null
+ Config: configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
+ - Name: vit-large-p16_mae-800e-pre_8xb128-coslr-50e_in1k
+ Metadata:
+ Epochs: 50
+ Batch Size: 1024
+ FLOPs: 61602103296
+ Parameters: 304324584
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.4
+ Weights: null
+ Config: configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
+ - Name: vit-large-p16_mae-1600e-pre_8xb128-coslr-50e_in1k
+ Metadata:
+ Epochs: 50
+ Batch Size: 1024
+ FLOPs: 61602103296
+ Parameters: 304324584
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.7
+ Weights: null
+ Config: configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
+ - Name: vit-large-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 16384
+ FLOPs: 61603112960
+ Parameters: 304326632
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 70.7
+ Weights: null
+ Config: configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
+ - Name: vit-large-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 16384
+ FLOPs: 61603112960
+ Parameters: 304326632
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 73.7
+ Weights: null
+ Config: configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
+ - Name: vit-large-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 16384
+ FLOPs: 61603112960
+ Parameters: 304326632
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 75.5
+ Weights: null
+ Config: configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
+ - Name: vit-huge-p14_mae-1600e-pre_8xb128-coslr-50e_in1k
+ Metadata:
+ Epochs: 50
+ Batch Size: 1024
+ FLOPs: 167399096320
+ Parameters: 632043240
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.9
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k_20220916-0bfc9bfd.pth
+ Config: configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py
+ - Name: vit-huge-p14_mae-1600e-pre_32xb8-coslr-50e_in1k-448px
+ Metadata:
+ Epochs: 50
+ Batch Size: 256
+ FLOPs: 732131983360
+ Parameters: 633026280
+ Training Data: ImageNet-1k
+ In Collection: MAE
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 87.3
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448_20220916-95b6a0ce.pth
+ Config: configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py
diff --git a/configs/maskfeat/README.md b/configs/maskfeat/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d25b32bb2d45990d91185de0cb34ee7e5dd9ecc5
--- /dev/null
+++ b/configs/maskfeat/README.md
@@ -0,0 +1,85 @@
+# MaskFeat
+
+> [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133v1)
+
+
+
+## Abstract
+
+We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :--------------------------------------------------------------------: |
+| `maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k` | 85.88 | 17.58 | [config](maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k` | [MASKFEAT](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth) | 86.57 | 17.58 | 83.40 | [config](benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.json) |
+
+## Citation
+
+```bibtex
+@InProceedings{wei2022masked,
+ author = {Wei, Chen and Fan, Haoqi and Xie, Saining and Wu, Chao-Yuan and Yuille, Alan and Feichtenhofer, Christoph},
+ title = {Masked Feature Prediction for Self-Supervised Visual Pre-Training},
+ booktitle = {CVPR},
+ year = {2022},
+}
+```
diff --git a/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py b/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a7620b46b4337adbff8aa97834d347c5da09e55
--- /dev/null
+++ b/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs'),
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=256, dataset=dict(pipeline=test_pipeline))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.1,
+ out_type='avg_featmap',
+ final_norm=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=2e-5, bias=2e-5)
+ ]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=8e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.65,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=80,
+ by_epoch=True,
+ begin=20,
+ end=100,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0)
diff --git a/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..465ff5c36465080be4ad50e6b1511b728c3318f1
--- /dev/null
+++ b/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,103 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='SelfSupDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.5, 1.0),
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='BEiTMaskGenerator',
+ input_size=14,
+ num_masking_patches=78,
+ min_num_patches=15,
+ ),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=256,
+ num_workers=8,
+ persistent_workers=True,
+ pin_memory=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/train.txt',
+ data_prefix=dict(img_path='train/'),
+ pipeline=train_pipeline))
+
+# model settings
+model = dict(
+ type='MaskFeat',
+ backbone=dict(type='MaskFeatViT', arch='b', patch_size=16),
+ neck=dict(
+ type='LinearNeck',
+ in_channels=768,
+ out_channels=108,
+ norm_cfg=None,
+ init_cfg=dict(type='TruncNormal', layer='Linear', std=0.02, bias=0)),
+ head=dict(
+ type='MIMHead',
+ loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+ target_generator=dict(
+ type='HOGGenerator', nbins=9, pool=8, gaussian_window=16))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW', lr=2e-4 * 8, betas=(0.9, 0.999), weight_decay=0.05),
+ clip_grad=dict(max_norm=0.02),
+ paramwise_cfg=dict(
+ bias_decay_mult=0.0,
+ norm_decay_mult=0.0,
+ flat_decay_mult=0.0,
+ custom_keys={
+ # 'pos_embed': dict(decay_mult=0.),
+ # 'cls_token': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-6,
+ by_epoch=True,
+ begin=0,
+ end=30,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=270,
+ eta_min=1e-6,
+ by_epoch=True,
+ begin=30,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/maskfeat/metafile.yml b/configs/maskfeat/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1e1e1b4ae263077d2f88bc40aa893a57e3bba14a
--- /dev/null
+++ b/configs/maskfeat/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+ - Name: MaskFeat
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ Training Resources: 8x A100-80G GPUs
+ Architecture:
+ - ViT
+ Paper:
+ Title: Masked Feature Prediction for Self-Supervised Visual Pre-Training
+ URL: https://arxiv.org/abs/2112.09133v1
+ README: configs/maskfeat/README.md
+
+Models:
+ - Name: maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 2048
+ FLOPs: 17581972224
+ Parameters: 85882692
+ Training Data: ImageNet-1k
+ In Collection: MaskFeat
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth
+ Config: configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+ Downstream:
+ - vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k
+ - Name: vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 2048
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MaskFeat
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.4
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth
+ Config: configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
diff --git a/configs/mff/README.md b/configs/mff/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7001c74be203f5997275372e57e9de4952a8f9f3
--- /dev/null
+++ b/configs/mff/README.md
@@ -0,0 +1,60 @@
+# MFF
+
+> [Improving Pixel-based MIM by Reducing Wasted Modeling Capability](https://arxiv.org/abs/2308.00261)
+
+
+
+## Abstract
+
+There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
+
+
+

+
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py None
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :-------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :------------------------------------------------------------------------------: |
+| `mff_vit-base-p16_8xb512-amp-coslr-300e_in1k` | - | - | [config](mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.json) |
+| `mff_vit-base-p16_8xb512-amp-coslr-800e_in1k` | - | - | [config](mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k` | [MFF 300-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) | 86.57 | 17.58 | 83.00 | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth) / [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.json) |
+| `vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k` | [MFF 800-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth) | 86.57 | 17.58 | 83.70 | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.pth) / [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.json) |
+| `vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k` | [MFF 300-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) | 304.33 | 61.60 | 64.20 | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb2048-linear-coslr-90e_in1k/vit-base-p16_8xb2048-linear-coslr-90e_in1k.json) |
+| `vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MFF 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) | 304.33 | 61.60 | 68.30 | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb2048-linear-coslr-90e/vit-base-p16_8xb2048-linear-coslr-90e_20230802-6b1f7bc8.pth) / [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb2048-linear-coslr-90e/vit-base-p16_8xb2048-linear-coslr-90e_20230802-6b1f7bc8.json) |
+
+## Citation
+
+```bibtex
+@article{MFF,
+ title={Improving Pixel-based MIM by Reducing Wasted Modeling Capability},
+ author={Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin},
+ journal={arXiv},
+ year={2023}
+}
+```
diff --git a/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cf9ca1134766cd3b0179b7581511cd94dedbbc2
--- /dev/null
+++ b/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.1,
+ out_type='avg_featmap',
+ final_norm=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=2e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.65,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ by_epoch=True,
+ begin=5,
+ end=100,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py b/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc5f23077a20dad906fb44cf074322b394ea021d
--- /dev/null
+++ b/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ToPIL', to_rgb=True),
+ dict(type='MAERandomResizedCrop', size=224, interpolation=3),
+ dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+ dict(type='ToNumpy', to_bgr=True),
+ dict(type='PackInputs'),
+]
+
+# dataset settings
+train_dataloader = dict(
+ batch_size=2048, drop_last=True, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ frozen_stages=12,
+ out_type='cls_token',
+ final_norm=True,
+ init_cfg=dict(type='Pretrained', prefix='backbone.')),
+ neck=dict(type='ClsBatchNormNeck', input_features=768),
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+ _delete_=True,
+ type='AmpOptimWrapper',
+ optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=80,
+ by_epoch=True,
+ begin=10,
+ end=90,
+ eta_min=0.0,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=1),
+ logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mff/metafile.yml b/configs/mff/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f1da4cc4823e7a4b80bb150987ceccd40e91bedd
--- /dev/null
+++ b/configs/mff/metafile.yml
@@ -0,0 +1,103 @@
+Collections:
+ - Name: MFF
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ Training Resources: 8x A100-80G GPUs
+ Architecture:
+ - ViT
+ Paper:
+ Title: Improving Pixel-based MIM by Reducing Wasted Modeling Capability
+ URL: https://arxiv.org/pdf/2308.00261.pdf
+ README: configs/mff/README.md
+
+Models:
+ - Name: mff_vit-base-p16_8xb512-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 2048
+ FLOPs: 17581972224
+ Parameters: 85882692
+ Training Data: ImageNet-1k
+ In Collection: MaskFeat
+ Results: null
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth
+ Config: configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+ Downstream:
+ - vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k
+ - vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k
+ - Name: mff_vit-base-p16_8xb512-amp-coslr-800e_in1k
+ Metadata:
+ Epochs: 800
+ Batch Size: 2048
+ FLOPs: 17581972224
+ Parameters: 85882692
+ Training Data: ImageNet-1k
+ In Collection: MaskFeat
+ Results: null
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth
+ Config: configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
+ Downstream:
+ - vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k
+ - vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k
+ - Name: vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MaskFeat
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.0
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth
+ Config: configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MFF
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.7
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.pth
+ Config: configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 16384
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MFF
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 64.2
+ Weights:
+ Config: configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+ - Name: vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 16384
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MFF
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 68.3
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth
+ Config: configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
diff --git a/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9fc5219e4d8d7384bfc0e24bc98c67a71964962
--- /dev/null
+++ b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,24 @@
+_base_ = '../mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py'
+
+randomness = dict(seed=2, diff_rank_seed=True)
+
+# dataset config
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ToPIL', to_rgb=True),
+ dict(type='torchvision/Resize', size=224),
+ dict(
+ type='torchvision/RandomCrop',
+ size=224,
+ padding=4,
+ padding_mode='reflect'),
+ dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+ dict(type='ToNumpy', to_bgr=True),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model config
+model = dict(
+ type='MFF', backbone=dict(type='MFFViT', out_indices=[0, 2, 4, 6, 8, 11]))
diff --git a/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8976b22dd94d4d5d0906542c495fc23833d8e02
--- /dev/null
+++ b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,24 @@
+_base_ = '../mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py'
+
+randomness = dict(seed=2, diff_rank_seed=True)
+
+# dataset config
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ToPIL', to_rgb=True),
+ dict(type='torchvision/Resize', size=224),
+ dict(
+ type='torchvision/RandomCrop',
+ size=224,
+ padding=4,
+ padding_mode='reflect'),
+ dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+ dict(type='ToNumpy', to_bgr=True),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model config
+model = dict(
+ type='MFF', backbone=dict(type='MFFViT', out_indices=[0, 2, 4, 6, 8, 11]))
diff --git a/configs/milan/README.md b/configs/milan/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e1fe2289c56d27bd2fb9c6655dce769e92b155c7
--- /dev/null
+++ b/configs/milan/README.md
@@ -0,0 +1,104 @@
+# MILAN
+
+> [MILAN: Masked Image Pretraining on Language Assisted Representation](https://arxiv.org/pdf/2208.06049)
+
+
+
+## Abstract
+
+Self-attention based transformer models have been dominating many computer
+vision tasks in the past few years. Their superb model qualities heavily depend
+on the excessively large labeled image datasets. In order to reduce the reliance
+on large labeled datasets, reconstruction based masked autoencoders are gaining
+popularity, which learn high quality transferable representations from unlabeled
+images. For the same purpose, recent weakly supervised image pretraining methods
+explore language supervision from text captions accompanying the images. In this
+work, we propose masked image pretraining on language assisted representation,
+dubbed as MILAN. Instead of predicting raw pixels or low level features, our
+pretraining objective is to reconstruct the image features with substantial semantic
+signals that are obtained using caption supervision. Moreover, to accommodate our
+reconstruction target, we propose a more efficient prompting decoder architecture
+and a semantic aware mask sampling mechanism, which further advance the
+transfer performance of the pretrained model. Experimental results demonstrate
+that MILAN delivers higher accuracy than the previous works. When the masked
+autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input
+resolution of 224×224, MILAN achieves a top-1 accuracy of 85.4% on ViTB/16, surpassing previous state-of-the-arts by 1%. In the downstream semantic
+segmentation task, MILAN achieves 52.7 mIoU using ViT-B/16 backbone on
+ADE20K dataset, outperforming previous masked pretraining results by 4 points.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_milan-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('milan_vit-base-p16_16xb256-amp-coslr-400e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :----------------------------------------------- | :--------: | :-------: | :---------------------------------------------------------: | :------------------------------------------------------------------------: |
+| `milan_vit-base-p16_16xb256-amp-coslr-400e_in1k` | 111.91 | 17.58 | [config](milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_milan-pre_8xb128-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) | 86.57 | 17.58 | 85.30 | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.json) |
+| `vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) | 86.57 | 17.58 | 78.90 | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.json) |
+
+## Citation
+
+```bibtex
+@article{Hou2022MILANMI,
+ title={MILAN: Masked Image Pretraining on Language Assisted Representation},
+ author={Zejiang Hou and Fei Sun and Yen-Kuang Chen and Yuan Xie and S. Y. Kung},
+ journal={ArXiv},
+ year={2022}
+}
+```
diff --git a/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a3f4983ac19208090ee63e9c9160b945b22ee6
--- /dev/null
+++ b/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.1,
+ out_type='avg_featmap',
+ final_norm=False,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=4e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.65,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ by_epoch=True,
+ begin=5,
+ end=100,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py b/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b7333ca475ad1d9607ddda898acb623e1bd7aa4
--- /dev/null
+++ b/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ frozen_stages=12,
+ out_type='cls_token',
+ final_norm=True,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=dict(type='ClsBatchNormNeck', input_features=768),
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]),
+ data_preprocessor=dict(
+ num_classes=1000,
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True,
+ ))
+
+# optimizer
+optim_wrapper = dict(
+ _delete_=True,
+ type='AmpOptimWrapper',
+ optimizer=dict(type='LARS', lr=3.2, weight_decay=0.0, momentum=0.9),
+)
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=90,
+ by_epoch=True,
+ begin=10,
+ end=100,
+ eta_min=0.0,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+ logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/milan/metafile.yml b/configs/milan/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..a790815fa28d063f909dfc1855b2a33f67f59893
--- /dev/null
+++ b/configs/milan/metafile.yml
@@ -0,0 +1,59 @@
+Collections:
+ - Name: MILAN
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ Training Resources: 16x A100-80G GPUs
+ Architecture:
+ - ViT
+ Paper:
+ Title: 'MILAN: Masked Image Pretraining on Language Assisted Representation'
+ URL: https://arxiv.org/pdf/2208.06049
+ README: configs/milan/README.md
+
+Models:
+ - Name: milan_vit-base-p16_16xb256-amp-coslr-400e_in1k
+ Metadata:
+ Epochs: 400
+ Batch Size: 4096
+ FLOPs: 17581972224
+ Parameters: 111907584
+ Training Data: ImageNet-1k
+ In Collection: MILAN
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth
+ Config: configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+ Downstream:
+ - vit-base-p16_milan-pre_8xb128-coslr-100e_in1k
+ - vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k
+ - Name: vit-base-p16_milan-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 17581215744
+ Parameters: 86566120
+ Training Data: ImageNet-1k
+ In Collection: MILAN
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.3
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
+ Config: configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+ - Name: vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 16384
+ FLOPs: 17581972992
+ Parameters: 86567656
+ Training Data: ImageNet-1k
+ In Collection: MILAN
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.9
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth
+ Config: configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
diff --git a/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py b/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac80ab7b1bff159eed3eacc432a1b7b48e4cb221
--- /dev/null
+++ b/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
@@ -0,0 +1,88 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+ type='MILAN',
+ backbone=dict(
+ type='MILANViT',
+ arch='b',
+ patch_size=16,
+ mask_ratio=0.75,
+ init_cfg=[
+ dict(type='Xavier', distribution='uniform', layer='Linear'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ]),
+ neck=dict(
+ type='MILANPretrainDecoder',
+ init_cfg=[
+ dict(type='Xavier', distribution='uniform', layer='Linear'),
+ dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+ ]),
+ head=dict(
+ type='MIMHead',
+ loss=dict(
+ type='CosineSimilarityLoss', shift_factor=2.0, scale_factor=2.0),
+ ),
+ target_generator=dict(
+ type='CLIPGenerator',
+ tokenizer_path= # noqa
+ 'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar' # noqa
+ ),
+ init_cfg=None)
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=360,
+ by_epoch=True,
+ begin=40,
+ end=400,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/minigpt4/README.md b/configs/minigpt4/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..23666fc9f951262bf9aee65dda933c0000b891f8
--- /dev/null
+++ b/configs/minigpt4/README.md
@@ -0,0 +1,53 @@
+# MiniGPT4
+
+> [MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models](https://arxiv.org/abs/2304.10592)
+
+
+
+## Abstract
+
+The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model's generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('minigpt-4_vicuna-7b_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'This image shows a small dog and a kitten sitting on a blanket in a field of flowers. The dog is looking up at the kitten with a playful expression on its face. The background is a colorful striped blanket, and there are flowers all around them. The image is well composed with the two animals sitting in the center of the frame, surrounded by the flowers and blanket.'}
+```
+
+
+
+## Models and results
+
+For Vicuna model, please refer to [MiniGPT-4 page](https://github.com/Vision-CAIR/MiniGPT-4) for preparation guidelines.
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :------------------------------ | :--------: | :-------: | :----------------------------------------: | :----------------------------------------------------------------------------------------------------------: |
+| `minigpt-4_baichuan-7b_caption` | 8094.77 | N/A | [config](minigpt-4_baichuan-7b_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_baichuan7b_20231011-5dca7ed6.pth) |
+| `minigpt-4_vicuna-7b_caption`\* | 8121.32 | N/A | [config](minigpt-4_vicuna-7b_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_vicuna7b_20230615-714b5f52.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Vision-CAIR/MiniGPT-4/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{zhu2023minigpt,
+ title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
+ author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
+ journal={arXiv preprint arXiv:2304.10592},
+ year={2023}
+}
+```
diff --git a/configs/minigpt4/metafile.yml b/configs/minigpt4/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f70cc9ba6045f237414f8dc3ee8572187528a667
--- /dev/null
+++ b/configs/minigpt4/metafile.yml
@@ -0,0 +1,37 @@
+Collections:
+ - Name: MiniGPT4
+ Metadata:
+ Architecture:
+ - Transformer
+ - Gated Cross-Attention Dense
+ Paper:
+ Title: 'MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models'
+ URL: https://arxiv.org/abs/2304.10592
+ README: configs/minigpt4/README.md
+
+Models:
+ - Name: minigpt-4_vicuna-7b_caption
+ Metadata:
+ FLOPs: null
+ Parameters: 8121315072
+ In Collection: MiniGPT4
+ Results:
+ - Task: Image Caption
+ Dataset: COCO
+ Metrics: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_vicuna7b_20230615-714b5f52.pth
+ Config: configs/minigpt4/minigpt-4_vicuna-7b_caption.py
+ Converted From:
+ Weights: https://github.com/Vision-CAIR/MiniGPT-4/tree/main
+ Code: https://github.com/Vision-CAIR/MiniGPT-4/tree/main
+ - Name: minigpt-4_baichuan-7b_caption
+ Metadata:
+ FLOPs: null
+ Parameters: 8094769024
+ In Collection: MiniGPT4
+ Results:
+ - Task: Image Caption
+ Dataset: COCO
+ Metrics: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_baichuan7b_20231011-5dca7ed6.pth
+ Config: configs/minigpt4/minigpt-4_baichuan-7b_caption.py
diff --git a/configs/minigpt4/minigpt-4_baichuan-7b_caption.py b/configs/minigpt4/minigpt-4_baichuan-7b_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e610a099c8dfcea86dff87c69487f6879926f21
--- /dev/null
+++ b/configs/minigpt4/minigpt-4_baichuan-7b_caption.py
@@ -0,0 +1,190 @@
+_base_ = [
+ '../_base_/default_runtime.py',
+]
+
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(224, 224),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='CleanCaption',
+ keys='chat_content',
+ remove_chars='',
+ lowercase=False),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['chat_content', 'lang'],
+ meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+ batch_size=2,
+ num_workers=4,
+ dataset=dict(
+ type='MiniGPT4Dataset',
+ data_root='YOUR_DATA_DIRECTORY',
+ ann_file='YOUR_DATA_FILE',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ drop_last=False,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(224, 224),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+test_dataloader = dict(
+ batch_size=1,
+ dataset=dict(
+ type='COCOCaption',
+ data_root='data/coco',
+ ann_file='annotations/coco_karpathy_val.json',
+ pipeline=test_pipeline))
+
+# model settings
+model = dict(
+ type='MiniGPT4',
+ vision_encoder=dict(
+ type='BEiTViT',
+ # eva-g without the final layer
+ arch=dict(
+ embed_dims=1408,
+ num_layers=39,
+ num_heads=16,
+ feedforward_channels=6144,
+ ),
+ img_size=224,
+ patch_size=14,
+ layer_scale_init_value=0.0,
+ frozen_stages=39,
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ final_norm=False,
+ use_shared_rel_pos_bias=False,
+ out_type='raw',
+ pretrained= # noqa
+ 'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth' # noqa
+ ),
+ q_former_model=dict(
+ type='Qformer',
+ model_style='bert-base-uncased',
+ vision_model_width=1408,
+ add_cross_attention=True,
+ cross_attention_freq=2,
+ num_query_token=32,
+ pretrained= # noqa
+ 'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_qformer_20230615-1dfa889c.pth' # noqa
+ ),
+ lang_encoder=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='baichuan-inc/baichuan-7B',
+ trust_remote_code=True),
+ tokenizer=dict(
+ type='AutoTokenizer',
+ name_or_path='baichuan-inc/baichuan-7B',
+ trust_remote_code=True),
+ task='caption',
+ prompt_template=dict([('en', '###Ask: {} ###Answer: '),
+ ('zh', '###问:{} ###答:')]),
+ raw_prompts=dict([
+ ('en', [('
'
+ 'Describe this image in detail.'),
+ ('
'
+ 'Take a look at this image and describe what you notice.'),
+ ('
'
+ 'Please provide a detailed description of the picture.'),
+ ('
'
+ 'Could you describe the contents of this image for me?')]),
+ ('zh', [('
'
+ '详细描述这张图片。'), ('
'
+ '浏览这张图片并描述你注意到什么。'),
+ ('
'
+ '请对这张图片进行详细的描述。'),
+ ('
'
+ '你能为我描述这张图片的内容吗?')])
+ ]),
+ max_txt_len=160,
+ end_sym='###')
+
+strategy = dict(
+ type='DeepSpeedStrategy',
+ fp16=dict(
+ enabled=True,
+ auto_cast=False,
+ fp16_master_weights_and_grads=False,
+ loss_scale=0,
+ loss_scale_window=1000,
+ hysteresis=1,
+ min_loss_scale=1,
+ initial_scale_power=16,
+ ),
+ inputs_to_half=[0],
+ zero_optimization=dict(
+ stage=2,
+ allgather_partitions=True,
+ allgather_bucket_size=2e8,
+ reduce_scatter=True,
+ reduce_bucket_size='auto',
+ overlap_comm=True,
+ contiguous_gradients=True,
+ ),
+)
+
+# schedule settings
+optim_wrapper = dict(
+ type='DeepSpeedOptimWrapper',
+ optimizer=dict(type='AdamW', lr=1e-3, weight_decay=0.05))
+
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-3 / 500,
+ by_epoch=False,
+ begin=0,
+ end=500,
+ ),
+ dict(
+ type='CosineAnnealingLR',
+ eta_min=2e-4,
+ by_epoch=False,
+ begin=500,
+ ),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=6)
+test_cfg = dict()
+
+runner_type = 'FlexibleRunner'
+
+default_hooks = dict(
+ checkpoint=dict(
+ type='CheckpointHook',
+ interval=1,
+ by_epoch=True,
+ save_last=True,
+ max_keep_ckpts=1,
+ ))
diff --git a/configs/minigpt4/minigpt-4_vicuna-7b_caption.py b/configs/minigpt4/minigpt-4_vicuna-7b_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..f468e2d8fac7ce46871801c9cc490acb97db683d
--- /dev/null
+++ b/configs/minigpt4/minigpt-4_vicuna-7b_caption.py
@@ -0,0 +1,94 @@
+_base_ = [
+ '../_base_/datasets/coco_caption.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(224, 224),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(batch_size=1, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ type='MiniGPT4',
+ vision_encoder=dict(
+ type='BEiTViT',
+ # eva-g without the final layer
+ arch=dict(
+ embed_dims=1408,
+ num_layers=39,
+ num_heads=16,
+ feedforward_channels=6144,
+ ),
+ img_size=224,
+ patch_size=14,
+ layer_scale_init_value=0.0,
+ frozen_stages=39,
+ use_abs_pos_emb=True,
+ use_rel_pos_bias=False,
+ final_norm=False,
+ use_shared_rel_pos_bias=False,
+ out_type='raw',
+ pretrained= # noqa
+ 'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth' # noqa
+ ),
+ q_former_model=dict(
+ type='Qformer',
+ model_style='bert-base-uncased',
+ vision_model_width=1408,
+ add_cross_attention=True,
+ cross_attention_freq=2,
+ num_query_token=32,
+ pretrained= # noqa
+ 'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_qformer_20230615-1dfa889c.pth' # noqa
+ ),
+ lang_encoder=dict(
+ type='AutoModelForCausalLM', name_or_path='YOUR_PATH_TO_VICUNA'),
+ tokenizer=dict(type='LlamaTokenizer', name_or_path='YOUR_PATH_TO_VICUNA'),
+ task='caption',
+ prompt_template=dict([('en', '###Ask: {} ###Answer: '),
+ ('zh', '###问:{} ###答:')]),
+ raw_prompts=dict([
+ ('en', [('
'
+ 'Describe this image in detail.'),
+ ('
'
+ 'Take a look at this image and describe what you notice.'),
+ ('
'
+ 'Please provide a detailed description of the picture.'),
+ ('
'
+ 'Could you describe the contents of this image for me?')]),
+ ('zh', [('
'
+ '详细描述这张图片。'), ('
'
+ '浏览这张图片并描述你注意到什么。'),
+ ('
'
+ '请对这张图片进行详细的描述。'),
+ ('
'
+ '你能为我描述这张图片的内容吗?')])
+ ]),
+ max_txt_len=160,
+ end_sym='###')
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+ dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ begin=0,
+ end=5,
+ )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=5)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/mixmim/README.md b/configs/mixmim/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e07f5011b32463a7be65d2cbe285148e88a6b3fc
--- /dev/null
+++ b/configs/mixmim/README.md
@@ -0,0 +1,102 @@
+# MixMIM
+
+> [MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning](https://arxiv.org/abs/2205.13137)
+
+
+
+## Abstract
+
+In this study, we propose Mixed and Masked Image Modeling (MixMIM), a
+simple but efficient MIM method that is applicable to various hierarchical Vision
+Transformers. Existing MIM methods replace a random subset of input tokens with
+a special [MASK] symbol and aim at reconstructing original image tokens from
+the corrupted image. However, we find that using the [MASK] symbol greatly
+slows down the training and causes training-finetuning inconsistency, due to the
+large masking ratio (e.g., 40% in BEiT). In contrast, we replace the masked tokens
+of one image with visible tokens of another image, i.e., creating a mixed image.
+We then conduct dual reconstruction to reconstruct the original two images from
+the mixed input, which significantly improves efficiency. While MixMIM can
+be applied to various architectures, this paper explores a simpler but stronger
+hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical
+results demonstrate that MixMIM can learn high-quality visual representations
+efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1
+accuracy on ImageNet-1K by pretraining for 600 epochs, setting a new record for
+neural networks with comparable model sizes (e.g., ViT-B) among MIM methods.
+Besides, its transferring performances on the other 6 datasets show MixMIM has
+better FLOPs / performance tradeoff than previous MIM methods
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mixmim_mixmim-base_16xb128-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------: | :--------------------------------------------------------------------------------: |
+| `mixmim_mixmim-base_16xb128-coslr-300e_in1k` | 114.67 | 16.35 | [config](mixmim_mixmim-base_16xb128-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k` | [MIXMIM](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth) | 88.34 | 16.35 | 84.63 | [config](benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.json) |
+
+## Citation
+
+```bibtex
+@article{MixMIM2022,
+ author = {Jihao Liu, Xin Huang, Yu Liu, Hongsheng Li},
+ journal = {arXiv:2205.13137},
+ title = {MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning},
+ year = {2022},
+}
+```
diff --git a/configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py b/configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c48ee3b8b64e96490e4e9ceaaab5b2b5b1f3f3cc
--- /dev/null
+++ b/configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,133 @@
+_base_ = [
+ '../../_base_/models/mixmim/mixmim_base.py',
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+
+data_preprocessor = dict(
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=16,
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/train.txt',
+ data_prefix='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ persistent_workers=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+val_dataloader = dict(
+ batch_size=64,
+ num_workers=8,
+ pin_memory=True,
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/val.txt',
+ data_prefix='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+test_dataloader = val_dataloader
+
+model = dict(
+ backbone=dict(
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(
+ type='AdamW',
+ lr=5e-4 * (8 * 128 / 256),
+ betas=(0.9, 0.999),
+ weight_decay=0.05),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.7,
+ custom_keys={
+ '.ln': dict(decay_mult=0.0), # do not decay on ln and bias
+ '.bias': dict(decay_mult=0.0)
+ }))
+
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-6,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ eta_min=1e-6,
+ by_epoch=True,
+ begin=5,
+ end=100,
+ convert_to_iter_based=True)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=1))
diff --git a/configs/mixmim/benchmarks/mixmim-base_8xb64_in1k.py b/configs/mixmim/benchmarks/mixmim-base_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..86ada85f4ef1e7934e44b4f044ff9d9adf88f782
--- /dev/null
+++ b/configs/mixmim/benchmarks/mixmim-base_8xb64_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../../_base_/models/mixmim/mixmim_base.py',
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/schedules/imagenet_bs256.py',
+ '../../_base_/default_runtime.py'
+]
diff --git a/configs/mixmim/metafile.yml b/configs/mixmim/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5bf87bda937f5091629c89143fd997cad0deb132
--- /dev/null
+++ b/configs/mixmim/metafile.yml
@@ -0,0 +1,51 @@
+Collections:
+ - Name: MixMIM
+ Metadata:
+ Architecture:
+ - Attention Dropout
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ - Tanh Activation
+ Paper:
+ Title: 'MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation
+ Learning'
+ URL: https://arxiv.org/abs/2205.13137
+ README: configs/mixmim/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/mixmim.py
+ Version: v1.0.0rc4
+
+Models:
+ - Name: mixmim_mixmim-base_16xb128-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 2048
+ FLOPs: 16351906816
+ Parameters: 114665784
+ Training Data: ImageNet-1k
+ In Collection: MixMIM
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth
+ Config: configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
+ Downstream:
+ - mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k
+ - Name: mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 1024
+ FLOPs: 16351906816
+ Parameters: 88344352
+ Training Data: ImageNet-1k
+ In Collection: MixMIM
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.63
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth
+ Config: configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py
diff --git a/configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py b/configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..29b94eaea311767a7fe91c47753680e5af6d0181
--- /dev/null
+++ b/configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
@@ -0,0 +1,98 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='SelfSupDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.2, 1.0),
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+ batch_size=128,
+ num_workers=8,
+ persistent_workers=True,
+ pin_memory=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/train.txt',
+ data_prefix=dict(img_path='train/'),
+ pipeline=train_pipeline))
+
+# model settings
+model = dict(
+ type='MixMIM',
+ backbone=dict(
+ type='MixMIMPretrainTransformer',
+ arch='B',
+ drop_rate=0.0,
+ drop_path_rate=0.0, # drop_path_rate=0.0 during pretraining
+ mask_ratio=0.5),
+ neck=dict(
+ type='MixMIMPretrainDecoder',
+ num_patches=49,
+ encoder_stride=32,
+ embed_dim=1024,
+ decoder_embed_dim=512,
+ decoder_depth=8,
+ decoder_num_heads=16),
+ head=dict(
+ type='MixMIMPretrainHead',
+ norm_pix=True,
+ loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * (2048 / 256),
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0)
+ }))
+
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=260,
+ by_epoch=True,
+ begin=40,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=1))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mlp_mixer/README.md b/configs/mlp_mixer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f0bb4ce0984627f9dafe2f86910348cc20a8a0a7
--- /dev/null
+++ b/configs/mlp_mixer/README.md
@@ -0,0 +1,78 @@
+# MLP-Mixer
+
+> [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601)
+
+
+
+## Abstract
+
+Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mlp-mixer-base-p16_3rdparty_64xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mlp-mixer-base-p16_3rdparty_64xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------: | :-------------------------------------------------------------: |
+| `mlp-mixer-base-p16_3rdparty_64xb64_in1k`\* | From scratch | 59.88 | 12.61 | 76.68 | 92.25 | [config](mlp-mixer-base-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth) |
+| `mlp-mixer-large-p16_3rdparty_64xb64_in1k`\* | From scratch | 208.20 | 44.57 | 72.34 | 88.02 | [config](mlp-mixer-large-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-large-p16_3rdparty_64xb64_in1k_20211124-5a2519d2.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{tolstikhin2021mlpmixer,
+ title={MLP-Mixer: An all-MLP Architecture for Vision},
+ author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
+ year={2021},
+ eprint={2105.01601},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
diff --git a/configs/mlp_mixer/metafile.yml b/configs/mlp_mixer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8b632db100373b10ad7653ed9e0302fa37013ee4
--- /dev/null
+++ b/configs/mlp_mixer/metafile.yml
@@ -0,0 +1,50 @@
+Collections:
+ - Name: MLP-Mixer
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - MLP
+ - Layer Normalization
+ - Dropout
+ Paper:
+ URL: https://arxiv.org/abs/2105.01601
+ Title: "MLP-Mixer: An all-MLP Architecture for Vision"
+ README: configs/mlp_mixer/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.18.0/mmcls/models/backbones/mlp_mixer.py
+ Version: v0.18.0
+
+Models:
+ - Name: mlp-mixer-base-p16_3rdparty_64xb64_in1k
+ In Collection: MLP-Mixer
+ Config: configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py
+ Metadata:
+ FLOPs: 12610000000 # 12.61 G
+ Parameters: 59880000 # 59.88 M
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.68
+ Top 5 Accuracy: 92.25
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_mixer_b16_224-76587d61.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py#L70
+
+ - Name: mlp-mixer-large-p16_3rdparty_64xb64_in1k
+ In Collection: MLP-Mixer
+ Config: configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py
+ Metadata:
+ FLOPs: 44570000000 # 44.57 G
+ Parameters: 208200000 # 208.2 M
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 72.34
+ Top 5 Accuracy: 88.02
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-large-p16_3rdparty_64xb64_in1k_20211124-5a2519d2.pth
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_mixer_b16_224_in21k-617b3de2.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py#L73
diff --git a/configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py b/configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbf4268d3c6121be57d48e8577f3edebde05114b
--- /dev/null
+++ b/configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+ '../_base_/models/mlp_mixer_base_patch16.py',
+ '../_base_/datasets/imagenet_bs64_mixer_224.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py',
+]
+
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py b/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fbe9c5c9ebc70ee1b718e904af1bc49fb6d3c78
--- /dev/null
+++ b/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+ '../_base_/models/mlp_mixer_large_patch16.py',
+ '../_base_/datasets/imagenet_bs64_mixer_224.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py',
+]
+
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/mobilenet_v2/README.md b/configs/mobilenet_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..74548e19698ead42fd7cfb86f8a7c04fbee7f022
--- /dev/null
+++ b/configs/mobilenet_v2/README.md
@@ -0,0 +1,97 @@
+# MobileNet V2
+
+> [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381)
+
+
+
+## Introduction
+
+**MobileNet V2** is initially described in [the paper](https://arxiv.org/pdf/1801.04381.pdf), which improves the state of the art performance of mobile models on multiple tasks. MobileNetV2 is an improvement on V1. Its new ideas include Linear Bottleneck and Inverted Residuals, and is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. The author of MobileNet V2 measure its performance on Imagenet classification, COCO object detection, and VOC image segmentation.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.
+
+The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilenet-v2_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilenet-v2_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `mobilenet-v2_8xb32_in1k` | From scratch | 3.50 | 0.32 | 71.86 | 90.42 | [config](mobilenet-v2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.json) |
+
+## Citation
+
+```bibtex
+@INPROCEEDINGS{8578572,
+ author={M. {Sandler} and A. {Howard} and M. {Zhu} and A. {Zhmoginov} and L. {Chen}},
+ booktitle={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+ title={MobileNetV2: Inverted Residuals and Linear Bottlenecks},
+ year={2018},
+ volume={},
+ number={},
+ pages={4510-4520},
+ doi={10.1109/CVPR.2018.00474}}
+}
+```
diff --git a/configs/mobilenet_v2/metafile.yml b/configs/mobilenet_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..aaa490ae485e87c3965f946f3fe25aa52919830b
--- /dev/null
+++ b/configs/mobilenet_v2/metafile.yml
@@ -0,0 +1,34 @@
+Collections:
+ - Name: MobileNet V2
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Epochs: 300
+ Batch Size: 256
+ Architecture:
+ - MobileNet V2
+ Paper:
+ URL: https://arxiv.org/abs/1801.04381
+ Title: "MobileNetV2: Inverted Residuals and Linear Bottlenecks"
+ README: configs/mobilenet_v2/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/mobilenet_v2.py#L101
+ Version: v0.15.0
+
+Models:
+ - Name: mobilenet-v2_8xb32_in1k
+ Metadata:
+ FLOPs: 319000000
+ Parameters: 3500000
+ In Collection: MobileNet V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 71.86
+ Top 5 Accuracy: 90.42
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth
+ Config: configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
diff --git a/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py b/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..afd2d9795af601010833ba239465c3e2c5abdf20
--- /dev/null
+++ b/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/mobilenet_v2_1x.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_epochstep.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/mobilenet_v3/README.md b/configs/mobilenet_v3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..833de5b25aae9a8af43f5e086e6e2fd212669d03
--- /dev/null
+++ b/configs/mobilenet_v3/README.md
@@ -0,0 +1,99 @@
+# MobileNet V3
+
+> [Searching for MobileNetV3](https://arxiv.org/abs/1905.02244)
+
+
+
+## Introduction
+
+**MobileNet V3** is initially described in [the paper](https://arxiv.org/pdf/1905.02244.pdf). MobileNetV3 parameters are obtained by NAS (network architecture search) search, and some practical results of V1 and V2 are inherited, and the attention mechanism of SE channel is attracted, which can be considered as a masterpiece. The author create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. The author of MobileNet V3 measure its performance on Imagenet classification, COCO object detection, and Cityscapes segmentation.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 15% compared to MobileNetV2. MobileNetV3-Small is 4.6% more accurate while reducing latency by 5% compared to MobileNetV2. MobileNetV3-Large detection is 25% faster at roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilenet-v3-small-050_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilenet-v3-small-050_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :--------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------------: | :--------------------------------------------------------------: |
+| `mobilenet-v3-small-050_3rdparty_in1k`\* | From scratch | 1.59 | 0.02 | 57.91 | 80.19 | [config](mobilenet-v3-small-050_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth) |
+| `mobilenet-v3-small-075_3rdparty_in1k`\* | From scratch | 2.04 | 0.04 | 65.23 | 85.44 | [config](mobilenet-v3-small-075_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-075_3rdparty_in1k_20221114-2011fa76.pth) |
+| `mobilenet-v3-small_8xb128_in1k` | From scratch | 2.54 | 0.06 | 66.68 | 86.74 | [config](mobilenet-v3-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.json) |
+| `mobilenet-v3-small_3rdparty_in1k`\* | From scratch | 2.54 | 0.06 | 67.66 | 87.41 | [config](mobilenet-v3-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0.pth) |
+| `mobilenet-v3-large_8xb128_in1k` | From scratch | 5.48 | 0.23 | 73.49 | 91.31 | [config](mobilenet-v3-large_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.json) |
+| `mobilenet-v3-large_3rdparty_in1k`\* | From scratch | 5.48 | 0.23 | 74.04 | 91.34 | [config](mobilenet-v3-large_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{Howard_2019_ICCV,
+ author = {Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig},
+ title = {Searching for MobileNetV3},
+ booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+ month = {October},
+ year = {2019}
+}
+```
diff --git a/configs/mobilenet_v3/metafile.yml b/configs/mobilenet_v3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..53f1653682fa2af2155b786ee5a8f0be9c98698e
--- /dev/null
+++ b/configs/mobilenet_v3/metafile.yml
@@ -0,0 +1,111 @@
+Collections:
+ - Name: MobileNet V3
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - RMSprop with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Epochs: 600
+ Batch Size: 1024
+ Architecture:
+ - MobileNet V3
+ Paper:
+ URL: https://arxiv.org/abs/1905.02244
+ Title: Searching for MobileNetV3
+ README: configs/mobilenet_v3/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/mobilenet_v3.py
+ Version: v0.15.0
+
+Models:
+ - Name: mobilenet-v3-small-050_3rdparty_in1k
+ Metadata:
+ FLOPs: 24895000
+ Parameters: 1590000
+ In Collection: MobileNet V3
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 57.91
+ Top 5 Accuracy: 80.19
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth
+ Config: configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/mobilenetv3_small_050_lambc-4b7bbe87.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilenetv3.py
+ - Name: mobilenet-v3-small-075_3rdparty_in1k
+ Metadata:
+ FLOPs: 44791000
+ Parameters: 2040000
+ In Collection: MobileNet V3
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 65.23
+ Top 5 Accuracy: 85.44
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-075_3rdparty_in1k_20221114-2011fa76.pth
+ Config: configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/mobilenetv3_small_075_lambc-384766db.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilenetv3.py
+ - Name: mobilenet-v3-small_8xb128_in1k
+ Metadata:
+ FLOPs: 60000000
+ Parameters: 2540000
+ In Collection: MobileNet V3
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 66.68
+ Top 5 Accuracy: 86.74
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.pth
+ Config: configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
+ - Name: mobilenet-v3-small_3rdparty_in1k
+ Metadata:
+ FLOPs: 60000000
+ Parameters: 2540000
+ In Collection: MobileNet V3
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 67.66
+ Top 5 Accuracy: 87.41
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0.pth
+ Config: configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
+ Converted From:
+ Weights: https://download.pytorch.org/models/mobilenet_v3_small-047dcff4.pth
+ Code: https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py
+ - Name: mobilenet-v3-large_8xb128_in1k
+ Metadata:
+ FLOPs: 230000000
+ Parameters: 5480000
+ In Collection: MobileNet V3
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 73.49
+ Top 5 Accuracy: 91.31
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.pth
+ Config: configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
+ - Name: mobilenet-v3-large_3rdparty_in1k
+ Metadata:
+ FLOPs: 230000000
+ Parameters: 5480000
+ In Collection: MobileNet V3
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.04
+ Top 5 Accuracy: 91.34
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth
+ Config: configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
+ Converted From:
+ Weights: https://download.pytorch.org/models/mobilenet_v3_large-8738ca79.pth
+ Code: https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py
diff --git a/configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5c05baf39f1cffdb9610d41b1603119a2edc727
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
@@ -0,0 +1,28 @@
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+
+_base_ = [
+ '../_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py',
+ '../_base_/datasets/imagenet_bs128_mbv3.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(
+ type='RMSprop',
+ lr=0.064,
+ alpha=0.9,
+ momentum=0.9,
+ eps=0.0316,
+ weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc145625ca22f44ff48a6f4684589ab6833313e3
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+ '../_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py',
+ '../_base_/datasets/imagenet_bs128_mbv3.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(norm_cfg=dict(type='BN', eps=1e-5, momentum=0.1)))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='AutoAugment',
+ policies='imagenet',
+ hparams=dict(pad_val=[round(x) for x in [103.53, 116.28, 123.675]])),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.2,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(
+ type='RMSprop',
+ lr=0.064,
+ alpha=0.9,
+ momentum=0.9,
+ eps=0.0316,
+ weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..464b7cbd60e8b741f9765df091bfdadbfe1712a3
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py
@@ -0,0 +1,68 @@
+_base_ = [
+ '../_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py',
+ '../_base_/datasets/imagenet_bs128_mbv3.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(norm_cfg=dict(type='BN', eps=1e-5, momentum=0.1)))
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='AutoAugment',
+ policies='imagenet',
+ hparams=dict(pad_val=[round(x) for x in [103.53, 116.28, 123.675]])),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.2,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(
+ type='RMSprop',
+ lr=0.064,
+ alpha=0.9,
+ momentum=0.9,
+ eps=0.0316,
+ weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..06b0a328106611ced7ede94c0439f3e39d12f306
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
@@ -0,0 +1,28 @@
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+
+_base_ = [
+ '../_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py',
+ '../_base_/datasets/imagenet_bs128_mbv3.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(
+ type='RMSprop',
+ lr=0.064,
+ alpha=0.9,
+ momentum=0.9,
+ eps=0.0316,
+ weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small_8xb16_cifar10.py b/configs/mobilenet_v3/mobilenet-v3-small_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cfaa2f629523ad66966d3e70c9ca084644e1f8d
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small_8xb16_cifar10.py
@@ -0,0 +1,15 @@
+_base_ = [
+ '../_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py',
+ '../_base_/datasets/cifar10_bs16.py',
+ '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+ type='MultiStepLR',
+ by_epoch=True,
+ milestones=[120, 170],
+ gamma=0.1,
+)
+
+train_cfg = dict(by_epoch=True, max_epochs=200)
diff --git a/configs/mobileone/README.md b/configs/mobileone/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e753aff9089fe30700f6db4313fd337f73f7d47d
--- /dev/null
+++ b/configs/mobileone/README.md
@@ -0,0 +1,98 @@
+# MobileOne
+
+> [An Improved One millisecond Mobile Backbone](https://arxiv.org/abs/2206.04040)
+
+
+
+## Introduction
+
+Mobileone is proposed by apple and based on reparameterization. On the apple chips, the accuracy of the model is close to 0.76 on the ImageNet dataset when the latency is less than 1ms. Its main improvements based on [RepVGG](../repvgg) are fllowing:
+
+- Reparameterization using Depthwise convolution and Pointwise convolution instead of normal convolution.
+- Removal of the residual structure which is not friendly to access memory.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device. Therefore, we perform extensive analysis of different metrics by deploying several mobile-friendly networks on a mobile device. We identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks. To this end, we design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile. Our best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. Our model obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, we show that our model generalizes to multiple tasks - image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobileone-s0_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobileone-s0_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobileone/mobileone-s0_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobileone/mobileone-s0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `mobileone-s0_8xb32_in1k` | From scratch | 2.08 | 0.27 | 71.34 | 89.87 | [config](mobileone-s0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.json) |
+| `mobileone-s1_8xb32_in1k` | From scratch | 4.76 | 0.82 | 75.72 | 92.54 | [config](mobileone-s1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.json) |
+| `mobileone-s2_8xb32_in1k` | From scratch | 7.81 | 1.30 | 77.37 | 93.34 | [config](mobileone-s2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.json) |
+| `mobileone-s3_8xb32_in1k` | From scratch | 10.08 | 1.89 | 78.06 | 93.83 | [config](mobileone-s3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.json) |
+| `mobileone-s4_8xb32_in1k` | From scratch | 14.84 | 2.98 | 79.69 | 94.46 | [config](mobileone-s4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.json) |
+
+## Citation
+
+```bibtex
+@article{mobileone2022,
+ title={An Improved One millisecond Mobile Backbone},
+ author={Vasu, Pavan Kumar Anasosalu and Gabriel, James and Zhu, Jeff and Tuzel, Oncel and Ranjan, Anurag},
+ journal={arXiv preprint arXiv:2206.04040},
+ year={2022}
+}
+```
diff --git a/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..145f3f4ec90f643a056177a7d7c0b8fc370539cc
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s0_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8602c31ce6c7c3115e3f45313b687816f0854ddb
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s1_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..97aaddd0740b0a005ecab5b08d3459b0da6c474c
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s2_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d335a7ba9300f8d6d35a288dab02baf0adabdb2
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s3_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b82f5a9ac7ecd6c5fc84369083c66d6dae0afd51
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s4_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/metafile.yml b/configs/mobileone/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..70370da0e8d56baf8001ddaff1f78110462ad86a
--- /dev/null
+++ b/configs/mobileone/metafile.yml
@@ -0,0 +1,83 @@
+Collections:
+ - Name: MobileOne
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - re-parameterization Convolution
+ - VGG-style Neural Network
+ - Depthwise Convolution
+ - Pointwise Convolution
+ Paper:
+ URL: https://arxiv.org/abs/2206.04040
+ Title: 'An Improved One millisecond Mobile Backbone'
+ README: configs/mobileone/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/mobileone/metafile.yml
+ Version: v1.0.0rc1
+
+Models:
+ - Name: mobileone-s0_8xb32_in1k
+ In Collection: MobileOne
+ Config: configs/mobileone/mobileone-s0_8xb32_in1k.py
+ Metadata:
+ FLOPs: 274136576 # 0.27G
+ Parameters: 2078504 # 2.08M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 71.34
+ Top 5 Accuracy: 89.87
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth
+ - Name: mobileone-s1_8xb32_in1k
+ In Collection: MobileOne
+ Config: configs/mobileone/mobileone-s1_8xb32_in1k.py
+ Metadata:
+ FLOPs: 823839744 # 8.6G
+ Parameters: 4764840 # 4.82M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 75.72
+ Top 5 Accuracy: 92.54
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.pth
+ - Name: mobileone-s2_8xb32_in1k
+ In Collection: MobileOne
+ Config: configs/mobileone/mobileone-s2_8xb32_in1k.py
+ Metadata:
+ FLOPs: 1296478848
+ Parameters: 7808168
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 77.37
+ Top 5 Accuracy: 93.34
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.pth
+ - Name: mobileone-s3_8xb32_in1k
+ In Collection: MobileOne
+ Config: configs/mobileone/mobileone-s3_8xb32_in1k.py
+ Metadata:
+ FLOPs: 1893842944
+ Parameters: 10078312
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 78.06
+ Top 5 Accuracy: 93.83
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.pth
+ - Name: mobileone-s4_8xb32_in1k
+ In Collection: MobileOne
+ Config: configs/mobileone/mobileone-s4_8xb32_in1k.py
+ Metadata:
+ FLOPs: 2979222528
+ Parameters: 14838352
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 79.69
+ Top 5 Accuracy: 94.46
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.pth
diff --git a/configs/mobileone/mobileone-s0_8xb32_in1k.py b/configs/mobileone/mobileone-s0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..be56b86c3ce4afc3cc61995efa60830be98050e0
--- /dev/null
+++ b/configs/mobileone/mobileone-s0_8xb32_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+ '../_base_/models/mobileone/mobileone_s0.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+custom_hooks = [
+ dict(
+ type='EMAHook',
+ momentum=5e-4,
+ priority='ABOVE_NORMAL',
+ update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s1_8xb32_in1k.py b/configs/mobileone/mobileone-s1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0bc3fb08922e0c87ad681e79c378d2b5404b696f
--- /dev/null
+++ b/configs/mobileone/mobileone-s1_8xb32_in1k.py
@@ -0,0 +1,60 @@
+_base_ = [
+ '../_base_/models/mobileone/mobileone_s1.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=7,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+ dict(type='PackInputs')
+]
+
+import copy # noqa: E402
+
+# modify start epoch's RandomResizedCrop.scale to 160
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.1
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+# modify 37 epoch's RandomResizedCrop.scale to 192
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_1e[3]['magnitude_level'] *= 0.2
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+
+custom_hooks = [
+ dict(
+ type='SwitchRecipeHook',
+ schedule=[
+ dict(action_epoch=37, pipeline=train_pipeline_37e),
+ dict(action_epoch=112, pipeline=train_pipeline_112e),
+ ]),
+ dict(
+ type='EMAHook',
+ momentum=5e-4,
+ priority='ABOVE_NORMAL',
+ update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s2_8xb32_in1k.py b/configs/mobileone/mobileone-s2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7d4aae074952538d5d037b33438172f4c283613
--- /dev/null
+++ b/configs/mobileone/mobileone-s2_8xb32_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+ '../_base_/models/mobileone/mobileone_s2.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+import copy # noqa: E402
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=7,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+ dict(type='PackInputs')
+]
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+import copy # noqa: E402
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+ dict(
+ type='SwitchRecipeHook',
+ schedule=[
+ dict(action_epoch=37, pipeline=train_pipeline_37e),
+ dict(action_epoch=112, pipeline=train_pipeline_112e),
+ ]),
+ dict(
+ type='EMAHook',
+ momentum=5e-4,
+ priority='ABOVE_NORMAL',
+ update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s3_8xb32_in1k.py b/configs/mobileone/mobileone-s3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2be0dc7e814c4e5a28369ae8888221f3e26ec657
--- /dev/null
+++ b/configs/mobileone/mobileone-s3_8xb32_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+ '../_base_/models/mobileone/mobileone_s3.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+import copy # noqa: E402
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=7,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+ dict(type='PackInputs')
+]
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+import copy # noqa: E402
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+ dict(
+ type='SwitchRecipeHook',
+ schedule=[
+ dict(action_epoch=37, pipeline=train_pipeline_37e),
+ dict(action_epoch=112, pipeline=train_pipeline_112e),
+ ]),
+ dict(
+ type='EMAHook',
+ momentum=5e-4,
+ priority='ABOVE_NORMAL',
+ update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s4_8xb32_in1k.py b/configs/mobileone/mobileone-s4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..49356f05f9574f90192dc32d5b14c3b74a5cd459
--- /dev/null
+++ b/configs/mobileone/mobileone-s4_8xb32_in1k.py
@@ -0,0 +1,63 @@
+_base_ = [
+ '../_base_/models/mobileone/mobileone_s4.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=7,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+ dict(type='PackInputs')
+]
+
+import copy # noqa: E402
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+ dict(
+ type='SwitchRecipeHook',
+ schedule=[
+ dict(action_epoch=37, pipeline=train_pipeline_37e),
+ dict(action_epoch=112, pipeline=train_pipeline_112e),
+ ]),
+ dict(
+ type='EMAHook',
+ momentum=5e-4,
+ priority='ABOVE_NORMAL',
+ update_buffers=True)
+]
diff --git a/configs/mobilevit/README.md b/configs/mobilevit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fa0960d123aed6eae6fee1155fd99d0955355280
--- /dev/null
+++ b/configs/mobilevit/README.md
@@ -0,0 +1,96 @@
+# MobileViT
+
+> [MobileViT Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178)
+
+
+
+## Introduction
+
+**MobileViT** aims at introducing a light-weight network, which takes the advantages of both ViTs and CNNs, uses the `InvertedResidual` blocks in [MobileNetV2](../mobilenet_v2/README.md) and `MobileViTBlock` which refers to [ViT](../vision_transformer/README.md) transformer blocks to build a standard 5-stage model structure.
+
+The MobileViTBlock reckons transformers as convolutions to perform a global representation, meanwhile conbined with original convolution layers for local representation to build a block with global receptive field. This is different from ViT, which adds an extra class token and position embeddings for learning relative relationship. Without any position embeddings, MobileViT can benfit from multi-scale inputs during training.
+
+Also, this paper puts forward a strategy for multi-scale training to dynamically adjust batch size based on the image size to both improve training efficiency and final performance.
+
+It is also proven effective in downstream tasks such as object detection and segmentation.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+
+Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilevit-small_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilevit-small_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mobilevit/mobilevit-small_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------: |
+| `mobilevit-small_3rdparty_in1k`\* | From scratch | 5.58 | 2.03 | 78.25 | 94.09 | [config](mobilevit-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth) |
+| `mobilevit-xsmall_3rdparty_in1k`\* | From scratch | 2.32 | 1.05 | 74.75 | 92.32 | [config](mobilevit-xsmall_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xsmall_3rdparty_in1k_20221018-be39a6e7.pth) |
+| `mobilevit-xxsmall_3rdparty_in1k`\* | From scratch | 1.27 | 0.42 | 69.02 | 88.91 | [config](mobilevit-xxsmall_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xxsmall_3rdparty_in1k_20221018-77835605.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/apple/ml-cvnets). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{mehta2021mobilevit,
+ title={MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
+ author={Mehta, Sachin and Rastegari, Mohammad},
+ journal={arXiv preprint arXiv:2110.02178},
+ year={2021}
+}
+```
diff --git a/configs/mobilevit/metafile.yml b/configs/mobilevit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..15fd84ad54cacf0c7c0337b5139ba891d14c22f5
--- /dev/null
+++ b/configs/mobilevit/metafile.yml
@@ -0,0 +1,60 @@
+Collections:
+ - Name: MobileViT
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - MobileViT Block
+ Paper:
+ URL: https://arxiv.org/abs/2110.02178
+ Title: MobileViT Light-weight, General-purpose, and Mobile-friendly Vision Transformer
+ README: configs/mobilevit/README.md
+
+Models:
+ - Name: mobilevit-small_3rdparty_in1k
+ Metadata:
+ FLOPs: 2030000000
+ Parameters: 5580000
+ In Collection: MobileViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.25
+ Top 5 Accuracy: 94.09
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth
+ Config: configs/mobilevit/mobilevit-small_8xb128_in1k.py
+ Converted From:
+ Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_s.pt
+ Code: https://github.com/apple/ml-cvnets
+ - Name: mobilevit-xsmall_3rdparty_in1k
+ Metadata:
+ FLOPs: 1050000000
+ Parameters: 2320000
+ In Collection: MobileViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.75
+ Top 5 Accuracy: 92.32
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xsmall_3rdparty_in1k_20221018-be39a6e7.pth
+ Config: configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
+ Converted From:
+ Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_xs.pt
+ Code: https://github.com/apple/ml-cvnets
+ - Name: mobilevit-xxsmall_3rdparty_in1k
+ Metadata:
+ FLOPs: 420000000
+ Parameters: 1270000
+ In Collection: MobileViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 69.02
+ Top 5 Accuracy: 88.91
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xxsmall_3rdparty_in1k_20221018-77835605.pth
+ Config: configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
+ Converted From:
+ Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_xxs.pt
+ Code: https://github.com/apple/ml-cvnets
diff --git a/configs/mobilevit/mobilevit-small_8xb128_in1k.py b/configs/mobilevit/mobilevit-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..596939631c0520e67d480a37669704556719f2dc
--- /dev/null
+++ b/configs/mobilevit/mobilevit-small_8xb128_in1k.py
@@ -0,0 +1,30 @@
+_base_ = [
+ '../_base_/models/mobilevit/mobilevit_s.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/default_runtime.py',
+ '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0, 0, 0],
+ std=[255, 255, 255],
+ # use bgr directly
+ to_rgb=False,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=288, edge='short'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+ batch_size=128,
+ dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
diff --git a/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py b/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..557892bcc4911912d7e5d585cb0d27235cf08cd5
--- /dev/null
+++ b/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
@@ -0,0 +1,30 @@
+_base_ = [
+ '../_base_/models/mobilevit/mobilevit_xs.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/default_runtime.py',
+ '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0, 0, 0],
+ std=[255, 255, 255],
+ # use bgr directly
+ to_rgb=False,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=288, edge='short'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+ batch_size=128,
+ dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
diff --git a/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py b/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..74aea82f32bd65fd71962c588384e4a1e6ab43ea
--- /dev/null
+++ b/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
@@ -0,0 +1,30 @@
+_base_ = [
+ '../_base_/models/mobilevit/mobilevit_xxs.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/default_runtime.py',
+ '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[0, 0, 0],
+ std=[255, 255, 255],
+ # use bgr directly
+ to_rgb=False,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=288, edge='short'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+ batch_size=128,
+ dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
diff --git a/configs/mocov2/README.md b/configs/mocov2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cb0ae4ee7468f3294b28157eafb32cb04b63814d
--- /dev/null
+++ b/configs/mocov2/README.md
@@ -0,0 +1,85 @@
+# MoCoV2
+
+> [Improved Baselines with Momentum Contrastive Learning](https://arxiv.org/abs/2003.04297)
+
+
+
+## Abstract
+
+Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR’s design improvements by implementing them in the MoCo framework. With simple modifications to MoCo—namely, using an MLP projection head and more data augmentation—we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mocov2_resnet50_8xb32-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :-------------------------------------- | :--------: | :-------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `mocov2_resnet50_8xb32-coslr-200e_in1k` | 55.93 | 4.11 | [config](mocov2_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k` | [MOCOV2](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth) | 25.56 | 4.11 | 67.50 | [config](benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.json) |
+
+## Citation
+
+```bibtex
+@article{chen2020improved,
+ title={Improved baselines with momentum contrastive learning},
+ author={Chen, Xinlei and Fan, Haoqi and Girshick, Ross and He, Kaiming},
+ journal={arXiv preprint arXiv:2003.04297},
+ year={2020}
+}
+```
diff --git a/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py b/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..37795d9c866c5f9b26b0e016959a01677b8a216e
--- /dev/null
+++ b/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+ '../../_base_/models/resnet50.py',
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_sgd_steplr_100e.py',
+ '../../_base_/default_runtime.py',
+]
+
+model = dict(
+ backbone=dict(
+ frozen_stages=4,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=30., momentum=0.9, weight_decay=0.))
+
+# runtime settings
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov2/metafile.yml b/configs/mocov2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..4440db45b5a1a6ab8352c589471cbd4b6d6bb786
--- /dev/null
+++ b/configs/mocov2/metafile.yml
@@ -0,0 +1,45 @@
+Collections:
+ - Name: MoCoV2
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Architecture:
+ - ResNet
+ - MoCo
+ Paper:
+ Title: Improved Baselines with Momentum Contrastive Learning
+ URL: https://arxiv.org/abs/2003.04297
+ README: configs/mocov2/README.md
+
+Models:
+ - Name: mocov2_resnet50_8xb32-coslr-200e_in1k
+ Metadata:
+ Epochs: 200
+ Batch Size: 256
+ FLOPs: 4109364224
+ Parameters: 55933312
+ Training Data: ImageNet-1k
+ In Collection: MoCoV2
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth
+ Config: configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
+ Downstream:
+ - resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k
+ - Name: resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 256
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: MoCoV2
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 67.5
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth
+ Config: configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
diff --git a/configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py b/configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8037d075a2e5a8490dc4c3709f274784a6f3f4f0
--- /dev/null
+++ b/configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_mocov2.py',
+ '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='MoCo',
+ queue_len=65536,
+ feat_dim=128,
+ momentum=0.001,
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='BN'),
+ zero_init_residual=False),
+ neck=dict(
+ type='MoCoV2Neck',
+ in_channels=2048,
+ hid_channels=2048,
+ out_channels=128,
+ with_avg_pool=True),
+ head=dict(
+ type='ContrastiveHead',
+ loss=dict(type='CrossEntropyLoss'),
+ temperature=0.2))
+
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/mocov3/README.md b/configs/mocov3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a9477e8a6da037a4e773bcb693b0f449f8e8fda7
--- /dev/null
+++ b/configs/mocov3/README.md
@@ -0,0 +1,96 @@
+# MoCoV3
+
+> [An Empirical Study of Training Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.02057)
+
+
+
+## Abstract
+
+This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mocov3_resnet50_8xb512-amp-coslr-100e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :--------------------------------------------------------------------: |
+| `mocov3_resnet50_8xb512-amp-coslr-100e_in1k` | 68.01 | 4.11 | [config](mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.json) |
+| `mocov3_resnet50_8xb512-amp-coslr-300e_in1k` | 68.01 | 4.11 | [config](mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.json) |
+| `mocov3_resnet50_8xb512-amp-coslr-800e_in1k` | 68.01 | 4.11 | [config](mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.json) |
+| `mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k` | 84.27 | 4.61 | [config](mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.json) |
+| `mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k` | 215.68 | 17.58 | [config](mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.json) |
+| `mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k` | 652.78 | 61.60 | [config](mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth) | 25.56 | 4.11 | 69.60 | [config](benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.json) |
+| `resnet50_mocov3-300e-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3 300-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.pth) | 25.56 | 4.11 | 72.80 | [config](benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-d21ddac2.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-d21ddac2.json) |
+| `resnet50_mocov3-800e-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.pth) | 25.56 | 4.11 | 74.40 | [config](benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-0e97a483.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-0e97a483.json) |
+| `vit-small-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.pth) | 22.05 | 4.61 | 73.60 | [config](benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k_20220826-376674ef.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k_20220826-376674ef.json) |
+| `vit-base-p16_mocov3-pre_8xb64-coslr-150e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth) | 86.57 | 17.58 | 83.00 | [config](benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k_20220826-f1e6c442.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k_20220826-f1e6c442.json) |
+| `vit-base-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth) | 86.57 | 17.58 | 76.90 | [config](benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k_20220826-83be7758.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k_20220826-83be7758.json) |
+| `vit-large-p16_mocov3-pre_8xb64-coslr-100e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.pth) | 304.33 | 61.60 | 83.70 | [config](benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k_20220829-878a2f7f.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k_20220829-878a2f7f.json) |
+
+## Citation
+
+```bibtex
+@InProceedings{Chen_2021_ICCV,
+ title = {An Empirical Study of Training Self-Supervised Vision Transformers},
+ author = {Chen, Xinlei and Xie, Saining and He, Kaiming},
+ booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+ year = {2021}
+}
+```
diff --git a/configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py b/configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d0b202b0f643c51e5d931cbf1ee59793aae03cb
--- /dev/null
+++ b/configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+ '../../_base_/models/resnet50.py',
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_sgd_coslr_100e.py',
+ '../../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+model = dict(
+ backbone=dict(
+ frozen_stages=4,
+ norm_eval=True,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=0.4, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py b/configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..91509fc05d6b6274a4bf5237d27d9e28ee365b9d
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MoCoV3ViT',
+ arch='base', # embed_dim = 768
+ img_size=224,
+ patch_size=16,
+ stop_grad_conv1=True,
+ frozen_stages=12,
+ norm_eval=True,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ init_cfg=dict(type='Normal', std=0.01, layer='Linear'),
+ ))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=12, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py b/configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3d074f6ed93a4f5b108c441d00b12cb51802a62
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.1,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(
+ type='AdamW', lr=5e-4, eps=1e-8, betas=(0.9, 0.999),
+ weight_decay=0.05),
+ clip_grad=dict(max_norm=5.0),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=145,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=5,
+ end=150,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=150)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+randomness = dict(seed=0)
diff --git a/configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py b/configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..46d7f48299edfa39316eeb137c71d72d3a7955b7
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='large',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.5,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=1024,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ]),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(
+ type='AdamW', lr=5e-4, eps=1e-8, betas=(0.9, 0.999),
+ weight_decay=0.05),
+ clip_grad=dict(max_norm=5.0),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=95,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=5,
+ end=100,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+randomness = dict(seed=0)
diff --git a/configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py b/configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c1ffa1972641194beff66d2e4ccfa31e5426fca
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='MoCoV3ViT',
+ arch='mocov3-small', # embed_dim = 384
+ img_size=224,
+ patch_size=16,
+ stop_grad_conv1=True,
+ frozen_stages=12,
+ norm_eval=True,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ init_cfg=dict(type='Normal', std=0.01, layer='Linear'),
+ ))
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=12, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov3/metafile.yml b/configs/mocov3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..649d9f439e65f18b7b1613a861113425cba480ae
--- /dev/null
+++ b/configs/mocov3/metafile.yml
@@ -0,0 +1,201 @@
+Collections:
+ - Name: MoCoV3
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - LARS
+ Training Resources: 32x V100 GPUs
+ Architecture:
+ - ResNet
+ - ViT
+ - MoCo
+ Paper:
+ Title: An Empirical Study of Training Self-Supervised Vision Transformers
+ URL: https://arxiv.org/abs/2104.02057
+ README: configs/mocov3/README.md
+
+Models:
+ - Name: mocov3_resnet50_8xb512-amp-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 4096
+ FLOPs: 4109364224
+ Parameters: 68012160
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth
+ Config: configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
+ Downstream:
+ - resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k
+ - Name: mocov3_resnet50_8xb512-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 4096
+ FLOPs: 4109364224
+ Parameters: 68012160
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.pth
+ Config: configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py
+ Downstream:
+ - resnet50_mocov3-300e-pre_8xb128-linear-coslr-90e_in1k
+ - Name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k
+ Metadata:
+ Epochs: 800
+ Batch Size: 4096
+ FLOPs: 4109364224
+ Parameters: 68012160
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.pth
+ Config: configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
+ Downstream:
+ - resnet50_mocov3-800e-pre_8xb128-linear-coslr-90e_in1k
+ - Name: resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 1024
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 69.6
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.pth
+ Config: configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
+ - Name: resnet50_mocov3-300e-pre_8xb128-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 1024
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 72.8
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-d21ddac2.pth
+ Config: configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
+ - Name: resnet50_mocov3-800e-pre_8xb128-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 1024
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.4
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-0e97a483.pth
+ Config: configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
+ - Name: mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 4096
+ FLOPs: 4607954304
+ Parameters: 84266752
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.pth
+ Config: configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py
+ Downstream:
+ - vit-small-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+ - Name: vit-small-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 1024
+ FLOPs: 4607954304
+ Parameters: 22050664
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 73.6
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k_20220826-376674ef.pth
+ Config: configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py
+ - Name: mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 4096
+ FLOPs: 17581972224
+ Parameters: 215678464
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth
+ Config: configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py
+ Downstream:
+ - vit-base-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+ - vit-base-p16_mocov3-pre_8xb64-coslr-150e_in1k
+ - Name: vit-base-p16_mocov3-pre_8xb64-coslr-150e_in1k
+ Metadata:
+ Epochs: 150
+ Batch Size: 512
+ FLOPs: 17581972224
+ Parameters: 86567656
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.0
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k_20220826-f1e6c442.pth
+ Config: configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py
+ - Name: vit-base-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 1024
+ FLOPs: 17581972224
+ Parameters: 86567656
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.9
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k_20220826-83be7758.pth
+ Config: configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py
+ - Name: mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k
+ Metadata:
+ Epochs: 300
+ Batch Size: 4096
+ FLOPs: 61603111936
+ Parameters: 652781568
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.pth
+ Config: configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py
+ Downstream:
+ - vit-large-p16_mocov3-pre_8xb64-coslr-100e_in1k
+ - Name: vit-large-p16_mocov3-pre_8xb64-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 512
+ FLOPs: 61603111936
+ Parameters: 304326632
+ Training Data: ImageNet-1k
+ In Collection: MoCoV3
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.7
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k_20220829-878a2f7f.pth
+ Config: configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py
diff --git a/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4eabccad9017df0cb3838f423091365c30a7e12
--- /dev/null
+++ b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
@@ -0,0 +1,82 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mocov3.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+temperature = 1.0
+model = dict(
+ type='MoCoV3',
+ base_momentum=0.01, # 0.01 for 100e and 300e, 0.004 for 1000e
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=False),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=False,
+ with_last_bn=True,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=True),
+ head=dict(
+ type='MoCoV3Head',
+ predictor=dict(
+ type='NonLinearNeck',
+ in_channels=256,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=False,
+ with_last_bn=False,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=False),
+ loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+ temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(type='LARS', lr=9.6, weight_decay=1e-6, momentum=0.9),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'bn': dict(decay_mult=0, lars_exclude=True),
+ 'bias': dict(decay_mult=0, lars_exclude=True),
+ # bn layer in ResNet block downsample module
+ 'downsample.1': dict(decay_mult=0, lars_exclude=True),
+ }),
+)
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=90,
+ by_epoch=True,
+ begin=10,
+ end=100,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc0e4141032b0f8cbe82af08b653db9849013a36
--- /dev/null
+++ b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,82 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mocov3.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+temperature = 1.0
+model = dict(
+ type='MoCoV3',
+ base_momentum=0.01, # 0.01 for 100e and 300e, 0.004 for 1000e
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=False),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=False,
+ with_last_bn=True,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=True),
+ head=dict(
+ type='MoCoV3Head',
+ predictor=dict(
+ type='NonLinearNeck',
+ in_channels=256,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=False,
+ with_last_bn=False,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=False),
+ loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+ temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(type='LARS', lr=4.8, weight_decay=1e-6, momentum=0.9),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'bn': dict(decay_mult=0, lars_exclude=True),
+ 'bias': dict(decay_mult=0, lars_exclude=True),
+ # bn layer in ResNet block downsample module
+ 'downsample.1': dict(decay_mult=0, lars_exclude=True),
+ }),
+)
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=290,
+ by_epoch=True,
+ begin=10,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..87f18e350ca2209fd2958a867ea6bf9887c695e5
--- /dev/null
+++ b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,82 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mocov3.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+temperature = 1.0
+model = dict(
+ type='MoCoV3',
+ base_momentum=0.004, # 0.01 for 100e and 300e, 0.004 for 800 and 1000e
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=False),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=False,
+ with_last_bn=True,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=True),
+ head=dict(
+ type='MoCoV3Head',
+ predictor=dict(
+ type='NonLinearNeck',
+ in_channels=256,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=False,
+ with_last_bn=False,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=False),
+ loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+ temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(type='LARS', lr=4.8, weight_decay=1.5e-6, momentum=0.9),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'bn': dict(decay_mult=0, lars_exclude=True),
+ 'bias': dict(decay_mult=0, lars_exclude=True),
+ # bn layer in ResNet block downsample module
+ 'downsample.1': dict(decay_mult=0, lars_exclude=True),
+ }),
+)
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=790,
+ by_epoch=True,
+ begin=10,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b18fda74d646fbc6c85a0c95d70f52d91712142
--- /dev/null
+++ b/configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,151 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mocov3.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+# the difference between ResNet50 and ViT pipeline is the `scale` in
+# `RandomResizedCrop`, `scale=(0.08, 1.)` in ViT pipeline
+view_pipeline1 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.08, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=1.),
+ dict(type='Solarize', thr=128, prob=0.),
+ dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.08, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=0.1),
+ dict(type='Solarize', thr=128, prob=0.2),
+ dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='MultiView',
+ num_views=[1, 1],
+ transforms=[view_pipeline1, view_pipeline2]),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+
+# model settings
+temperature = 0.2
+model = dict(
+ type='MoCoV3',
+ base_momentum=0.01,
+ backbone=dict(
+ type='MoCoV3ViT',
+ arch='base', # embed_dim = 768
+ img_size=224,
+ patch_size=16,
+ stop_grad_conv1=True),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=768,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=3,
+ with_bias=False,
+ with_last_bn=True,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=False),
+ head=dict(
+ type='MoCoV3Head',
+ predictor=dict(
+ type='NonLinearNeck',
+ in_channels=256,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=False,
+ with_last_bn=True,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=False),
+ loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+ temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(type='AdamW', lr=2.4e-3, weight_decay=0.1))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=260,
+ by_epoch=True,
+ begin=40,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae31c6d8c9540640591a668be09f3cc670970283
--- /dev/null
+++ b/configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py
@@ -0,0 +1,154 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mocov3.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+# the difference between ResNet50 and ViT pipeline is the `scale` in
+# `RandomResizedCrop`, `scale=(0.08, 1.)` in ViT pipeline
+view_pipeline1 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.08, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=1.),
+ dict(type='Solarize', thr=128, prob=0.),
+ dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.08, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=0.1),
+ dict(type='Solarize', thr=128, prob=0.2),
+ dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='MultiView',
+ num_views=[1, 1],
+ transforms=[view_pipeline1, view_pipeline2]),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=64, dataset=dict(pipeline=train_pipeline))
+
+# model settings
+temperature = 0.2
+model = dict(
+ type='MoCoV3',
+ base_momentum=0.01,
+ backbone=dict(
+ type='MoCoV3ViT',
+ arch='large', # embed_dim = 1024
+ img_size=224,
+ patch_size=16,
+ stop_grad_conv1=True),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=1024,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=3,
+ with_bias=False,
+ with_last_bn=True,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=False),
+ head=dict(
+ type='MoCoV3Head',
+ predictor=dict(
+ type='NonLinearNeck',
+ in_channels=256,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=False,
+ with_last_bn=True,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=False),
+ loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+ temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ clip_grad=dict(max_norm=5.0, error_if_nonfinite=False),
+ optimizer=dict(type='AdamW', lr=2.4e-3, weight_decay=0.1))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=260,
+ by_epoch=True,
+ begin=40,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+randomness = dict(seed=0)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d26eec77d847c5f7fdb02b20bea224b43ce393d
--- /dev/null
+++ b/configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,151 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mocov3.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+# the difference between ResNet50 and ViT pipeline is the `scale` in
+# `RandomResizedCrop`, `scale=(0.08, 1.)` in ViT pipeline
+view_pipeline1 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.08, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=1.),
+ dict(type='Solarize', thr=128, prob=0.),
+ dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.08, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.4,
+ contrast=0.4,
+ saturation=0.2,
+ hue=0.1)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=0.1),
+ dict(type='Solarize', thr=128, prob=0.2),
+ dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='MultiView',
+ num_views=[1, 1],
+ transforms=[view_pipeline1, view_pipeline2]),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+
+# model settings
+temperature = 0.2
+model = dict(
+ type='MoCoV3',
+ base_momentum=0.01,
+ backbone=dict(
+ type='MoCoV3ViT',
+ arch='mocov3-small', # embed_dim = 384
+ img_size=224,
+ patch_size=16,
+ stop_grad_conv1=True),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=384,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=3,
+ with_bias=False,
+ with_last_bn=True,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=False),
+ head=dict(
+ type='MoCoV3Head',
+ predictor=dict(
+ type='NonLinearNeck',
+ in_channels=256,
+ hid_channels=4096,
+ out_channels=256,
+ num_layers=2,
+ with_bias=False,
+ with_last_bn=True,
+ with_last_bn_affine=False,
+ with_last_bias=False,
+ with_avg_pool=False),
+ loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+ temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(type='AdamW', lr=2.4e-3, weight_decay=0.1))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=260,
+ by_epoch=True,
+ begin=40,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mvit/README.md b/configs/mvit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1bf72e5e4cbb71c8ba548d9a730b0180e47fbc37
--- /dev/null
+++ b/configs/mvit/README.md
@@ -0,0 +1,85 @@
+# MViT V2
+
+> [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf)
+
+
+
+## Abstract
+
+In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video
+classification, as well as object detection. We present an improved version of MViT that incorporates
+decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture
+in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where
+it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where
+it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art
+performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as
+well as 86.1% on Kinetics-400 video classification.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mvitv2-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mvitv2-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mvit/mvitv2-tiny_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------: | :----------------------------------------------------------------------------------: |
+| `mvitv2-tiny_3rdparty_in1k`\* | From scratch | 24.17 | 4.70 | 82.33 | 96.15 | [config](mvitv2-tiny_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth) |
+| `mvitv2-small_3rdparty_in1k`\* | From scratch | 34.87 | 7.00 | 83.63 | 96.51 | [config](mvitv2-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-small_3rdparty_in1k_20220722-986bd741.pth) |
+| `mvitv2-base_3rdparty_in1k`\* | From scratch | 51.47 | 10.16 | 84.34 | 96.86 | [config](mvitv2-base_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-base_3rdparty_in1k_20220722-9c4f0a17.pth) |
+| `mvitv2-large_3rdparty_in1k`\* | From scratch | 217.99 | 43.87 | 85.25 | 97.14 | [config](mvitv2-large_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-large_3rdparty_in1k_20220722-2b57b983.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/mvit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{li2021improved,
+ title={MViTv2: Improved multiscale vision transformers for classification and detection},
+ author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
+ booktitle={CVPR},
+ year={2022}
+}
+```
diff --git a/configs/mvit/metafile.yml b/configs/mvit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..c16f4f8871562637e7251eb2950bd72d3fee7df7
--- /dev/null
+++ b/configs/mvit/metafile.yml
@@ -0,0 +1,95 @@
+Collections:
+ - Name: MViT V2
+ Metadata:
+ Architecture:
+ - Attention Dropout
+ - Convolution
+ - Dense Connections
+ - GELU
+ - Layer Normalization
+ - Scaled Dot-Product Attention
+ - Attention Pooling
+ Paper:
+ URL: http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf
+ Title: 'MViTv2: Improved Multiscale Vision Transformers for Classification and Detection'
+ README: configs/mvit/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.24.0/mmcls/models/backbones/mvit.py
+ Version: v0.24.0
+
+Models:
+ - Name: mvitv2-tiny_3rdparty_in1k
+ In Collection: MViT V2
+ Metadata:
+ FLOPs: 4703510768
+ Parameters: 24173320
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 82.33
+ Top 5 Accuracy: 96.15
+ Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_T_in1k.pyth
+ Code: https://github.com/facebookresearch/mvit
+ Config: configs/mvit/mvitv2-tiny_8xb256_in1k.py
+
+ - Name: mvitv2-small_3rdparty_in1k
+ In Collection: MViT V2
+ Metadata:
+ FLOPs: 6997555136
+ Parameters: 34870216
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 83.63
+ Top 5 Accuracy: 96.51
+ Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-small_3rdparty_in1k_20220722-986bd741.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_S_in1k.pyth
+ Code: https://github.com/facebookresearch/mvit
+ Config: configs/mvit/mvitv2-small_8xb256_in1k.py
+
+ - Name: mvitv2-base_3rdparty_in1k
+ In Collection: MViT V2
+ Metadata:
+ FLOPs: 10157964400
+ Parameters: 51472744
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 84.34
+ Top 5 Accuracy: 96.86
+ Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-base_3rdparty_in1k_20220722-9c4f0a17.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_B_in1k.pyth
+ Code: https://github.com/facebookresearch/mvit
+ Config: configs/mvit/mvitv2-base_8xb256_in1k.py
+
+ - Name: mvitv2-large_3rdparty_in1k
+ In Collection: MViT V2
+ Metadata:
+ FLOPs: 43868151412
+ Parameters: 217992952
+ Training Data:
+ - ImageNet-1k
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.25
+ Top 5 Accuracy: 97.14
+ Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-large_3rdparty_in1k_20220722-2b57b983.pth
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_L_in1k.pyth
+ Code: https://github.com/facebookresearch/mvit
+ Config: configs/mvit/mvitv2-large_8xb256_in1k.py
diff --git a/configs/mvit/mvitv2-base_8xb256_in1k.py b/configs/mvit/mvitv2-base_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee3ec11e2bc9873e21b58f0e3e940b5d9fc1e4d5
--- /dev/null
+++ b/configs/mvit/mvitv2-base_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+ '../_base_/models/mvit/mvitv2-base.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-4),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.pos_embed': dict(decay_mult=0.0),
+ '.rel_pos_h': dict(decay_mult=0.0),
+ '.rel_pos_w': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=70,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mvit/mvitv2-large_8xb256_in1k.py b/configs/mvit/mvitv2-large_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..eacddf96e9f9ab6b0da3f3edec973d69d41d1c9b
--- /dev/null
+++ b/configs/mvit/mvitv2-large_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+ '../_base_/models/mvit/mvitv2-large.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs2048_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-4),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.pos_embed': dict(decay_mult=0.0),
+ '.rel_pos_h': dict(decay_mult=0.0),
+ '.rel_pos_w': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=70,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mvit/mvitv2-small_8xb256_in1k.py b/configs/mvit/mvitv2-small_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..74cfd0a357a7ab773f5ac27404bbc0b78b06f901
--- /dev/null
+++ b/configs/mvit/mvitv2-small_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+ '../_base_/models/mvit/mvitv2-small.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs2048_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-4),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.pos_embed': dict(decay_mult=0.0),
+ '.rel_pos_h': dict(decay_mult=0.0),
+ '.rel_pos_w': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=70,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mvit/mvitv2-tiny_8xb256_in1k.py b/configs/mvit/mvitv2-tiny_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e563a2c9840fe27ae7ba4425976b540b40d21bc
--- /dev/null
+++ b/configs/mvit/mvitv2-tiny_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+ '../_base_/models/mvit/mvitv2-tiny.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs2048_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=2.5e-4),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.pos_embed': dict(decay_mult=0.0),
+ '.rel_pos_h': dict(decay_mult=0.0),
+ '.rel_pos_w': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ end=70,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/ofa/README.md b/configs/ofa/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..22e20f8bd85d41ed7faa1794273aeec002311f17
--- /dev/null
+++ b/configs/ofa/README.md
@@ -0,0 +1,88 @@
+# OFA
+
+> [OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework](https://arxiv.org/abs/2202.03052)
+
+
+
+## Abstract
+
+In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('ofa-base_3rdparty-finetuned_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a dog and a kitten sitting next to each other'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/ofa/ofa-base_finetuned_refcoco.py https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_refcoco_20230418-2797d3ab.pth
+```
+
+
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model | Params (M) | BLEU-4 | CIDER | Config | Download |
+| :-------------------------------------- | :--------: | :----: | :----: | :-------------------------------------: | :--------------------------------------------------------------------------------------------------: |
+| `ofa-base_3rdparty-finetuned_caption`\* | 182.24 | 42.64 | 144.50 | [config](ofa-base_finetuned_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-caption_20230418-de18914e.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/OFA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Grounding on RefCOCO
+
+| Model | Params (M) | Accuracy (testA) | Accuracy (testB) | Config | Download |
+| :-------------------------------------- | :--------: | :--------------: | :--------------: | :-------------------------------------: | :------------------------------------------------------------------------------: |
+| `ofa-base_3rdparty-finetuned_refcoco`\* | 182.24 | 90.49 | 83.63 | [config](ofa-base_finetuned_refcoco.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_refcoco_20230418-2797d3ab.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/OFA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Question Answering on VQAv2
+
+| Model | Params (M) | Accuracy | Config | Download |
+| :---------------------------------- | :--------: | :------: | :---------------------------------: | :--------------------------------------------------------------------------------------------------------------: |
+| `ofa-base_3rdparty-finetuned_vqa`\* | 182.24 | 78.00 | [config](ofa-base_finetuned_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-vqa_20230418-f38539a5.pth) |
+| `ofa-base_3rdparty-zeroshot_vqa`\* | 182.24 | 58.32 | [config](ofa-base_zeroshot_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_pretrain_20230418-dccfc07f.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/OFA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{wang2022ofa,
+ author = {Peng Wang and
+ An Yang and
+ Rui Men and
+ Junyang Lin and
+ Shuai Bai and
+ Zhikang Li and
+ Jianxin Ma and
+ Chang Zhou and
+ Jingren Zhou and
+ Hongxia Yang},
+ title = {OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence
+ Learning Framework},
+ journal = {CoRR},
+ volume = {abs/2202.03052},
+ year = {2022}
+}
+```
diff --git a/configs/ofa/metafile.yml b/configs/ofa/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..9c4b3ebf72b766ae64b89bc22ab60c616159af1d
--- /dev/null
+++ b/configs/ofa/metafile.yml
@@ -0,0 +1,89 @@
+Collections:
+ - Name: OFA
+ Metadata:
+ Architecture:
+ - ResNet
+ - Transformer
+ Training Data:
+ - CC12M
+ - CC3M
+ - SBU
+ - COCO
+ - VG
+ - VQAv2
+ - GQA
+ - RefCOCO
+ - OpenImages
+ - Object365
+ - YFCC100M
+ - ImageNet-21K
+ - Pile
+ Paper:
+ Title: 'OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
+ Sequence-to-Sequence Learning Framework'
+ URL: https://arxiv.org/abs/2202.03052
+ README: configs/ofa/README.md
+
+Models:
+ - Name: ofa-base_3rdparty-finetuned_refcoco
+ Metadata:
+ FLOPs: null
+ Parameters: 182238536
+ In Collection: OFA
+ Results:
+ - Task: Visual Grounding
+ Dataset: RefCOCO
+ Metrics:
+ Accuracy (testA): 90.49
+ Accuracy (testB): 83.63
+ Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_refcoco_20230418-2797d3ab.pth
+ Config: configs/ofa/ofa-base_finetuned_refcoco.py
+ Converted From:
+ Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_base_best.pt
+ Code: https://github.com/OFA-Sys/OFA
+ - Name: ofa-base_3rdparty-finetuned_vqa
+ Metadata:
+ FLOPs: null
+ Parameters: 182238536
+ In Collection: OFA
+ Results:
+ - Task: Visual Question Answering
+ Dataset: VQAv2
+ Metrics:
+ Accuracy: 78.00 # Report from the official repo
+ Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-vqa_20230418-f38539a5.pth
+ Config: configs/ofa/ofa-base_finetuned_vqa.py
+ Converted From:
+ Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/vqa_large_best.pt
+ Code: https://github.com/OFA-Sys/OFA
+ - Name: ofa-base_3rdparty-finetuned_caption
+ Metadata:
+ FLOPs: null
+ Parameters: 182238536
+ In Collection: OFA
+ Results:
+ - Task: Image Caption
+ Dataset: COCO
+ Metrics:
+ BLEU-4: 42.64
+ CIDER: 144.50
+ Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-caption_20230418-de18914e.pth
+ Config: configs/ofa/ofa-base_finetuned_caption.py
+ Converted From:
+ Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_base_best.pt
+ Code: https://github.com/OFA-Sys/OFA
+ - Name: ofa-base_3rdparty-zeroshot_vqa
+ Metadata:
+ FLOPs: null
+ Parameters: 182238536
+ In Collection: OFA
+ Results:
+ - Task: Visual Question Answering
+ Dataset: VQAv2
+ Metrics:
+ Accuracy: 58.32
+ Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_pretrain_20230418-dccfc07f.pth
+ Config: configs/ofa/ofa-base_zeroshot_vqa.py
+ Converted From:
+ Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_base.pt
+ Code: https://github.com/OFA-Sys/OFA
diff --git a/configs/ofa/ofa-base_finetuned_caption.py b/configs/ofa/ofa-base_finetuned_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..45efff06ec8ebd5ecc85dbdf15834819fb07bb38
--- /dev/null
+++ b/configs/ofa/ofa-base_finetuned_caption.py
@@ -0,0 +1,41 @@
+_base_ = [
+ '../_base_/datasets/coco_caption.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='OFA',
+ task='caption',
+ vocab_size=59457,
+ embedding_dim=768,
+ encoder_cfg=dict(
+ embed_images=dict(type='OFAResNet', depth=101),
+ num_layers=6,
+ ),
+ decoder_cfg=dict(num_layers=6),
+ generation_cfg=dict(use_cache=True),
+ tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=(480, 480)),
+ dict(type='PackInputs', meta_keys=('image_id', )),
+]
+
+train_dataloader = None
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-base_finetuned_refcoco.py b/configs/ofa/ofa-base_finetuned_refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a7435dbd467ed71b3ee6a4e2c6020083c180729
--- /dev/null
+++ b/configs/ofa/ofa-base_finetuned_refcoco.py
@@ -0,0 +1,45 @@
+_base_ = [
+ '../_base_/datasets/refcoco.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='OFA',
+ task='refcoco',
+ vocab_size=59457,
+ embedding_dim=768,
+ encoder_cfg=dict(
+ embed_images=dict(type='OFAResNet', depth=101),
+ num_layers=6,
+ ),
+ decoder_cfg=dict(num_layers=6),
+ generation_cfg=dict(use_cache=True),
+ tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=(512, 512)),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['text', 'gt_bboxes'],
+ meta_keys=['image_id', 'scale_factor'],
+ ),
+]
+
+train_dataloader = None
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-base_finetuned_vqa.py b/configs/ofa/ofa-base_finetuned_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..b120d091e5b9d1b38a3e0ebd1466f0fed9d0f611
--- /dev/null
+++ b/configs/ofa/ofa-base_finetuned_vqa.py
@@ -0,0 +1,64 @@
+_base_ = [
+ '../_base_/datasets/coco_vqa.py',
+ '../_base_/default_runtime.py',
+]
+
+ANS2LABEL = 'https://ofa-beijing.oss-cn-beijing.aliyuncs.com/datasets/vqa_data/trainval_ans2label.pkl' # noqa: E501
+
+# model settings
+model = dict(
+ type='OFA',
+ task='vqa',
+ vocab_size=59457,
+ embedding_dim=768,
+ ans2label=ANS2LABEL,
+ encoder_cfg=dict(
+ embed_images=dict(type='OFAResNet', depth=101),
+ num_layers=6,
+ num_heads=12,
+ ),
+ decoder_cfg=dict(
+ num_layers=6,
+ num_heads=12,
+ ),
+ generation_cfg=dict(
+ num_beams=5,
+ max_new_tokens=200,
+ length_penalty=0., # VQA doesn't require longer answer.
+ use_cache=True,
+ ),
+ tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(480, 480),
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='OFAAddObjects'),
+ dict(
+ type='PackInputs',
+ algorithm_keys=[
+ 'question', 'gt_answer', 'gt_answer_weight', 'decoder_prompt'
+ ],
+ meta_keys=['question_id', 'image_id'],
+ ),
+]
+
+train_dataloader = None # Eval only
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-base_zeroshot_vqa.py b/configs/ofa/ofa-base_zeroshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..9890cdd2a48484102877e3f3a946b73fefa6dbae
--- /dev/null
+++ b/configs/ofa/ofa-base_zeroshot_vqa.py
@@ -0,0 +1,42 @@
+_base_ = [
+ '../_base_/datasets/coco_vqa.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='OFA',
+ task='vqa',
+ vocab_size=59457,
+ embedding_dim=768,
+ encoder_cfg=dict(
+ embed_images=dict(type='OFAResNet', depth=101),
+ num_layers=6,
+ num_heads=12,
+ ),
+ decoder_cfg=dict(
+ num_layers=6,
+ num_heads=12,
+ ),
+ generation_cfg=dict(
+ num_beams=20,
+ max_new_tokens=200,
+ length_penalty=0., # VQA doesn't require longer answer.
+ use_cache=True,
+ ),
+ tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ to_rgb=True,
+)
+
+train_dataloader = None # Eval only
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-large_zeroshot_vqa.py b/configs/ofa/ofa-large_zeroshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b47121127c21baabbb963ccc8407a27d823cec1
--- /dev/null
+++ b/configs/ofa/ofa-large_zeroshot_vqa.py
@@ -0,0 +1,43 @@
+_base_ = [
+ '../_base_/datasets/coco_vqa.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='OFA',
+ task='vqa',
+ vocab_size=59457,
+ embedding_dim=1024,
+ encoder_cfg=dict(
+ embed_images=dict(type='OFAResNet', depth=152),
+ num_layers=12,
+ num_heads=16,
+ ),
+ decoder_cfg=dict(
+ num_layers=12,
+ num_heads=16,
+ ),
+ generation_cfg=dict(
+ num_beams=20,
+ max_new_tokens=200,
+ length_penalty=0., # VQA doesn't require longer answer.
+ use_cache=True,
+ ),
+ tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-large'),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ to_rgb=True,
+)
+
+train_dataloader = None # Eval only
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/otter/README.md b/configs/otter/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..18a84684f84e61a664c0742ff96ecaa440f2633b
--- /dev/null
+++ b/configs/otter/README.md
@@ -0,0 +1,79 @@
+# Otter
+
+> [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://arxiv.org/abs/2305.03726)
+
+
+
+## Abstract
+
+Large language models (LLMs) have demonstrated significant universal capabilities as few/zero-shot learners in various tasks due to their pre-training on vast amounts of text data, as exemplified by GPT-3, which boosted to InstrctGPT and ChatGPT, effectively following natural language instructions to accomplish real-world tasks. In this paper, we propose to introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. We adopt a similar approach to construct our MultI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning. We also optimize OpenFlamingo's implementation for researchers, democratizing the required training resources from 1$\times$ A100 GPU to 4$\times$ RTX-3090 GPUs, and integrate both OpenFlamingo and Otter into Huggingface Transformers for more researchers to incorporate the models into their customized training and inference pipelines.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model, inference_model
+
+model = get_model('otter-9b_3rdparty_caption', pretrained=True, device='cuda', generation_cfg=dict(max_new_tokens=50))
+out = inference_model(model, 'demo/cat-dog.png')
+print(out)
+# {'pred_caption': 'The image features two adorable small puppies sitting next to each other on the grass. One puppy is brown and white, while the other is tan and white. They appear to be relaxing outdoors, enjoying each other'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/otter/otter-9b_caption.py https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth
+```
+
+
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model | Params (M) | BLEU-4 | CIDER | Config | Download |
+| :---------------------------- | :--------: | :------: | :------: | :---------------------------: | :------------------------------------------------------------------------------------------------------: |
+| `otter-9b_3rdparty_caption`\* | 8220.45 | Upcoming | Upcoming | [config](otter-9b_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Luodian/Otter/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Question Answering on VQAv2
+
+| Model | Params (M) | Accuracy | Config | Download |
+| :------------------------ | :--------: | :------: | :-----------------------: | :------------------------------------------------------------------------------------------------------: |
+| `otter-9b_3rdparty_vqa`\* | 8220.45 | Upcoming | [config](otter-9b_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Luodian/Otter/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{li2023otter,
+ title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},
+ author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},
+ journal={arXiv preprint arXiv:2305.03726},
+ year={2023}
+}
+
+@article{li2023mimicit,
+ title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
+ author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
+ year={2023},
+ eprint={2306.05425},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
diff --git a/configs/otter/metafile.yml b/configs/otter/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6ee89c62a4d073b5eada03e8f9fbb3508041b8d5
--- /dev/null
+++ b/configs/otter/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+ - Name: Otter
+ Metadata:
+ Architecture:
+ - Transformer
+ - Gated Cross-Attention Dense
+ Paper:
+ Title: 'Otter: A Multi-Modal Model with In-Context Instruction Tuning'
+ URL: https://arxiv.org/abs/2305.03726
+ README: configs/otter/README.md
+
+Models:
+ - Name: otter-9b_3rdparty_caption
+ Metadata:
+ FLOPs: null
+ Parameters: 8220452880
+ In Collection: Otter
+ Results:
+ - Task: Image Caption
+ Dataset: COCO
+ Metrics:
+ BLEU-4: null
+ CIDER: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth
+ Config: configs/otter/otter-9b_caption.py
+ Converted From:
+ Weights: https://huggingface.co/luodian/otter-9b-hf
+ Code: https://github.com/Luodian/Otter/tree/main
+ - Name: otter-9b_3rdparty_vqa
+ Metadata:
+ FLOPs: null
+ Parameters: 8220452880
+ In Collection: Otter
+ Results:
+ - Task: Visual Question Answering
+ Dataset: VQAv2
+ Metrics:
+ Accuracy: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth
+ Config: configs/otter/otter-9b_vqa.py
+ Converted From:
+ Weights: https://huggingface.co/luodian/otter-9b-hf
+ Code: https://github.com/Luodian/Otter/tree/main
diff --git a/configs/otter/otter-9b_caption.py b/configs/otter/otter-9b_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..e35e92ef40cabcccd35f17dd661199b04a76dd6b
--- /dev/null
+++ b/configs/otter/otter-9b_caption.py
@@ -0,0 +1,87 @@
+_base_ = [
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='Otter',
+ tokenizer=dict(type='LlamaTokenizer', name_or_path='huggyllama/llama-7b'),
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='l',
+ patch_size=14,
+ pre_norm=True,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+ final_norm=False,
+ out_type='raw',
+ pretrained=(
+ 'https://download.openmmlab.com/mmclassification/v0/clip/'
+ 'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+ ),
+ lang_encoder=dict(
+ base=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='huggyllama/llama-7b',
+ local_files_only=True),
+ adapter=dict(
+ type='FlamingoLMAdapter',
+ vis_hidden_size=1024,
+ cross_attn_every_n_layers=4,
+ use_media_placement_augmentation=False,
+ only_attend_previous=True,
+ ),
+ ),
+ task='caption',
+ final_prompt_tmpl='User:Please describe the image. GPT:',
+ generation_cfg=dict(
+ num_beams=3, max_new_tokens=24, no_repeat_ngram_size=3),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=224,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CenterCrop', crop_size=(224, 224)),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['gt_caption'],
+ meta_keys=['image_id'],
+ ),
+]
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=8,
+ dataset=dict(
+ type='COCOCaption',
+ data_root='data/coco',
+ ann_file='annotations/coco_karpathy_val.json',
+ pipeline=test_pipeline,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+
+val_evaluator = dict(
+ type='COCOCaption',
+ ann_file='data/coco/annotations/coco_karpathy_val_gt.json')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/otter/otter-9b_vqa.py b/configs/otter/otter-9b_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..72f2b64281126cbf71a81929b12318b0a00f9e36
--- /dev/null
+++ b/configs/otter/otter-9b_vqa.py
@@ -0,0 +1,104 @@
+_base_ = [
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='Otter',
+ tokenizer=dict(type='LlamaTokenizer', name_or_path='huggyllama/llama-7b'),
+ vision_encoder=dict(
+ type='VisionTransformer',
+ arch='l',
+ patch_size=14,
+ pre_norm=True,
+ norm_cfg=dict(type='LN', eps=1e-5),
+ layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+ final_norm=False,
+ out_type='raw',
+ pretrained=(
+ 'https://download.openmmlab.com/mmclassification/v0/clip/'
+ 'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+ ),
+ lang_encoder=dict(
+ base=dict(
+ type='AutoModelForCausalLM',
+ name_or_path='huggyllama/llama-7b',
+ local_files_only=True),
+ adapter=dict(
+ type='FlamingoLMAdapter',
+ vis_hidden_size=1024,
+ cross_attn_every_n_layers=4,
+ use_media_placement_augmentation=False,
+ only_attend_previous=True,
+ ),
+ ),
+ task='vqa',
+ final_prompt_tmpl='User:{question} GPT:',
+ generation_cfg=dict(
+ num_beams=3, max_new_tokens=24, no_repeat_ngram_size=3),
+)
+
+# data settings
+data_preprocessor = dict(
+ type='MultiModalDataPreprocessor',
+ mean=[122.770938, 116.7460125, 104.09373615],
+ std=[68.5005327, 66.6321579, 70.32316305],
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=224,
+ interpolation='bicubic',
+ backend='pillow'),
+ dict(type='CenterCrop', crop_size=(224, 224)),
+ dict(
+ type='PackInputs',
+ algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+ meta_keys=['image_id'],
+ ),
+]
+
+val_dataloader = dict(
+ batch_size=8,
+ num_workers=8,
+ dataset=dict(
+ type='FlamingoEvalCOCOVQA',
+ data_root='data/coco',
+ data_prefix='val2014',
+ question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+ ann_file='annotations/v2_mscoco_val2014_annotations.json',
+ pipeline=test_pipeline,
+ num_shots=0,
+ num_support_examples=2048,
+ num_query_examples=5000,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+ batch_size=8,
+ num_workers=8,
+ dataset=dict(
+ type='FlamingoEvalCOCOVQA',
+ data_root='data/coco',
+ data_prefix='test2015',
+ question_file=
+ 'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+ pipeline=test_pipeline,
+ num_shots=0,
+ num_support_examples=2048,
+ num_query_examples=5000,
+ ),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/poolformer/README.md b/configs/poolformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2c4b249329ea03662f768aa350a08fb8eebc763b
--- /dev/null
+++ b/configs/poolformer/README.md
@@ -0,0 +1,80 @@
+# PoolFormer
+
+> [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)
+
+
+
+## Abstract
+
+Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 49%/61% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of "MetaFormer", a general architecture abstracted from transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('poolformer-s12_3rdparty_32xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('poolformer-s12_3rdparty_32xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/poolformer/poolformer-s12_32xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :--------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :---------------------------------------------------------------------: |
+| `poolformer-s12_3rdparty_32xb128_in1k`\* | From scratch | 11.92 | 1.87 | 77.24 | 93.51 | [config](poolformer-s12_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth) |
+| `poolformer-s24_3rdparty_32xb128_in1k`\* | From scratch | 21.39 | 3.51 | 80.33 | 95.05 | [config](poolformer-s24_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s24_3rdparty_32xb128_in1k_20220414-d7055904.pth) |
+| `poolformer-s36_3rdparty_32xb128_in1k`\* | From scratch | 30.86 | 5.15 | 81.43 | 95.45 | [config](poolformer-s36_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s36_3rdparty_32xb128_in1k_20220414-d78ff3e8.pth) |
+| `poolformer-m36_3rdparty_32xb128_in1k`\* | From scratch | 56.17 | 8.96 | 82.14 | 95.71 | [config](poolformer-m36_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m36_3rdparty_32xb128_in1k_20220414-c55e0949.pth) |
+| `poolformer-m48_3rdparty_32xb128_in1k`\* | From scratch | 73.47 | 11.80 | 82.51 | 95.95 | [config](poolformer-m48_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m48_3rdparty_32xb128_in1k_20220414-9378f3eb.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/sail-sg/poolformer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{yu2022metaformer,
+ title={Metaformer is actually what you need for vision},
+ author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
+ booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
+ pages={10819--10829},
+ year={2022}
+}
+```
diff --git a/configs/poolformer/metafile.yml b/configs/poolformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..55285ddd0230270030f25bef09b1461dc7278dc3
--- /dev/null
+++ b/configs/poolformer/metafile.yml
@@ -0,0 +1,99 @@
+Collections:
+ - Name: PoolFormer
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Pooling
+ - 1x1 Convolution
+ - LayerScale
+ Paper:
+ URL: https://arxiv.org/abs/2111.11418
+ Title: MetaFormer is Actually What You Need for Vision
+ README: configs/poolformer/README.md
+ Code:
+ Version: v0.22.1
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.22.1/mmcls/models/backbones/poolformer.py
+
+Models:
+ - Name: poolformer-s12_3rdparty_32xb128_in1k
+ Metadata:
+ FLOPs: 1871399424
+ Parameters: 11915176
+ In Collection: PoolFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.24
+ Top 5 Accuracy: 93.51
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth
+ Config: configs/poolformer/poolformer-s12_32xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s12.pth.tar
+ Code: https://github.com/sail-sg/poolformer
+ - Name: poolformer-s24_3rdparty_32xb128_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 3510411008
+ Parameters: 21388968
+ In Collection: PoolFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.33
+ Top 5 Accuracy: 95.05
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s24_3rdparty_32xb128_in1k_20220414-d7055904.pth
+ Config: configs/poolformer/poolformer-s24_32xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s24.pth.tar
+ Code: https://github.com/sail-sg/poolformer
+ - Name: poolformer-s36_3rdparty_32xb128_in1k
+ Metadata:
+ FLOPs: 5149422592
+ Parameters: 30862760
+ In Collection: PoolFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.43
+ Top 5 Accuracy: 95.45
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s36_3rdparty_32xb128_in1k_20220414-d78ff3e8.pth
+ Config: configs/poolformer/poolformer-s36_32xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s36.pth.tar
+ Code: https://github.com/sail-sg/poolformer
+ - Name: poolformer-m36_3rdparty_32xb128_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 8960175744
+ Parameters: 56172520
+ In Collection: PoolFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.14
+ Top 5 Accuracy: 95.71
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m36_3rdparty_32xb128_in1k_20220414-c55e0949.pth
+ Config: configs/poolformer/poolformer-m36_32xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m36.pth.tar
+ Code: https://github.com/sail-sg/poolformer
+ - Name: poolformer-m48_3rdparty_32xb128_in1k
+ Metadata:
+ FLOPs: 11801805696
+ Parameters: 73473448
+ In Collection: PoolFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.51
+ Top 5 Accuracy: 95.95
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m48_3rdparty_32xb128_in1k_20220414-9378f3eb.pth
+ Config: configs/poolformer/poolformer-m48_32xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m48.pth.tar
+ Code: https://github.com/sail-sg/poolformer
diff --git a/configs/poolformer/poolformer-m36_32xb128_in1k.py b/configs/poolformer/poolformer-m36_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..13065b8cf5100b4d16696d54cfa8c0a727541831
--- /dev/null
+++ b/configs/poolformer/poolformer-m36_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/poolformer/poolformer_m36.py',
+ '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-m48_32xb128_in1k.py b/configs/poolformer/poolformer-m48_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2078df39c4a16783b8f1a7ffc5c5da2b346eb1f0
--- /dev/null
+++ b/configs/poolformer/poolformer-m48_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/poolformer/poolformer_m48.py',
+ '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-s12_32xb128_in1k.py b/configs/poolformer/poolformer-s12_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7cf4a6365604def73f2ea293b857ebdc8b2ed9b3
--- /dev/null
+++ b/configs/poolformer/poolformer-s12_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/poolformer/poolformer_s12.py',
+ '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-s24_32xb128_in1k.py b/configs/poolformer/poolformer-s24_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffb2482d16c3e432c1f3d0a233a69a76b99efdd8
--- /dev/null
+++ b/configs/poolformer/poolformer-s24_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/poolformer/poolformer_s24.py',
+ '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-s36_32xb128_in1k.py b/configs/poolformer/poolformer-s36_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..842dab3ac51645046d15f04b8bc1ace42781144b
--- /dev/null
+++ b/configs/poolformer/poolformer-s36_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+ '../_base_/models/poolformer/poolformer_s36.py',
+ '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/regnet/README.md b/configs/regnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..63031f4e89b934d823ce53f08cdbad597729fd7e
--- /dev/null
+++ b/configs/regnet/README.md
@@ -0,0 +1,88 @@
+# RegNet
+
+> [Designing Network Design Spaces](https://arxiv.org/abs/2003.13678)
+
+
+
+## Abstract
+
+In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('regnetx-400mf_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('regnetx-400mf_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/regnet/regnetx-400mf_8xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/regnet/regnetx-400mf_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :-------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------: | :------------------------------------------------------------------------------------: |
+| `regnetx-400mf_8xb128_in1k` | From scratch | 5.16 | 0.41 | 72.56 | 90.78 | [config](regnetx-400mf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211208_143316.json) |
+| `regnetx-800mf_8xb128_in1k` | From scratch | 7.26 | 0.81 | 74.76 | 92.32 | [config](regnetx-800mf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-800mf_8xb128_in1k_20211213-222b0f11.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-800mf_8xb128_in1k_20211207_143037.log.json) |
+| `regnetx-1.6gf_8xb128_in1k` | From scratch | 9.19 | 1.63 | 76.84 | 93.31 | [config](regnetx-1.6gf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211213-d1b89758.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211208_143018.log.json) |
+| `regnetx-3.2gf_8xb64_in1k` | From scratch | 3.21 | 1.53 | 78.09 | 94.08 | [config](regnetx-3.2gf_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-3.2gf_8xb64_in1k_20211213-1fdd82ae.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-3.2gf_8xb64_in1k_20211208_142720.log.json) |
+| `regnetx-4.0gf_8xb64_in1k` | From scratch | 22.12 | 4.00 | 78.60 | 94.17 | [config](regnetx-4.0gf_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211213-efed675c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211207_150431.log.json) |
+| `regnetx-6.4gf_8xb64_in1k` | From scratch | 26.21 | 6.51 | 79.38 | 94.65 | [config](regnetx-6.4gf_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-6.4gf_8xb64_in1k_20211215-5c6089da.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-6.4gf_8xb64_in1k_20211213_172748.log.json) |
+| `regnetx-8.0gf_8xb64_in1k` | From scratch | 39.57 | 8.03 | 79.12 | 94.51 | [config](regnetx-8.0gf_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-8.0gf_8xb64_in1k_20211213-9a9fcc76.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-8.0gf_8xb64_in1k_20211208_103250.log.json) |
+| `regnetx-12gf_8xb64_in1k` | From scratch | 46.11 | 12.15 | 79.67 | 95.03 | [config](regnetx-12gf_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211213-5df8c2f8.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211208_143713.log.json) |
+
+## Citation
+
+```bibtex
+@article{radosavovic2020designing,
+ title={Designing Network Design Spaces},
+ author={Ilija Radosavovic and Raj Prateek Kosaraju and Ross Girshick and Kaiming He and Piotr Dollár},
+ year={2020},
+ eprint={2003.13678},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
diff --git a/configs/regnet/metafile.yml b/configs/regnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..4796a9f42a19092e956b3511467b84b26e372b99
--- /dev/null
+++ b/configs/regnet/metafile.yml
@@ -0,0 +1,122 @@
+Collections:
+ - Name: RegNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Neural Architecture Search
+ - Design Space Design
+ - Precise BN
+ - SGD with nesterov
+ Paper:
+ URL: https://arxiv.org/abs/2003.13678
+ Title: Designing Network Design Spaces
+ README: configs/regnet/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.18.0/mmcls/models/backbones/regnet.py
+ Version: v0.18.0
+
+Models:
+ - Name: regnetx-400mf_8xb128_in1k
+ In Collection: RegNet
+ Config: configs/regnet/regnetx-400mf_8xb128_in1k.py
+ Metadata:
+ FLOPs: 410000000 # 0.41G
+ Parameters: 5160000 # 5.16M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 72.56
+ Top 5 Accuracy: 90.78
+ Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth
+ - Name: regnetx-800mf_8xb128_in1k
+ In Collection: RegNet
+ Config: configs/regnet/regnetx-800mf_8xb128_in1k.py
+ Metadata:
+ FLOPs: 810000000 # 0.81G
+ Parameters: 7260000 # 7.26M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 74.76
+ Top 5 Accuracy: 92.32
+ Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-800mf_8xb128_in1k_20211213-222b0f11.pth
+ - Name: regnetx-1.6gf_8xb128_in1k
+ In Collection: RegNet
+ Config: configs/regnet/regnetx-1.6gf_8xb128_in1k.py
+ Metadata:
+ FLOPs: 1630000000 # 1.63G
+ Parameters: 9190000 # 9.19M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 76.84
+ Top 5 Accuracy: 93.31
+ Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211213-d1b89758.pth
+ - Name: regnetx-3.2gf_8xb64_in1k
+ In Collection: RegNet
+ Config: configs/regnet/regnetx-3.2gf_8xb64_in1k.py
+ Metadata:
+ FLOPs: 1530000000 # 1.53G
+ Parameters: 3210000 # 32.1M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 78.09
+ Top 5 Accuracy: 94.08
+ Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-3.2gf_8xb64_in1k_20211213-1fdd82ae.pth
+ - Name: regnetx-4.0gf_8xb64_in1k
+ In Collection: RegNet
+ Config: configs/regnet/regnetx-4.0gf_8xb64_in1k.py
+ Metadata:
+ FLOPs: 4000000000 # 4G
+ Parameters: 22120000 # 22.12M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 78.60
+ Top 5 Accuracy: 94.17
+ Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211213-efed675c.pth
+ - Name: regnetx-6.4gf_8xb64_in1k
+ In Collection: RegNet
+ Config: configs/regnet/regnetx-6.4gf_8xb64_in1k.py
+ Metadata:
+ FLOPs: 6510000000 # 6.51G
+ Parameters: 26210000 # 26.21M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 79.38
+ Top 5 Accuracy: 94.65
+ Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-6.4gf_8xb64_in1k_20211215-5c6089da.pth
+ - Name: regnetx-8.0gf_8xb64_in1k
+ In Collection: RegNet
+ Config: configs/regnet/regnetx-8.0gf_8xb64_in1k.py
+ Metadata:
+ FLOPs: 8030000000 # 8.03G
+ Parameters: 39570000 # 39.57M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 79.12
+ Top 5 Accuracy: 94.51
+ Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-8.0gf_8xb64_in1k_20211213-9a9fcc76.pth
+ - Name: regnetx-12gf_8xb64_in1k
+ In Collection: RegNet
+ Config: configs/regnet/regnetx-12gf_8xb64_in1k.py
+ Metadata:
+ FLOPs: 12150000000 # 12.15G
+ Parameters: 46110000 # 46.11M
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 79.67
+ Top 5 Accuracy: 95.03
+ Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211213-5df8c2f8.pth
diff --git a/configs/regnet/regnetx-1.6gf_8xb128_in1k.py b/configs/regnet/regnetx-1.6gf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3e9e934fede12e5c06673dc12898db35654cf2a
--- /dev/null
+++ b/configs/regnet/regnetx-1.6gf_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+ backbone=dict(type='RegNet', arch='regnetx_1.6gf'),
+ head=dict(in_channels=912, ))
diff --git a/configs/regnet/regnetx-12gf_8xb64_in1k.py b/configs/regnet/regnetx-12gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a2c0b5aa15ec760c461bf46d6ff9537c68f0fa4
--- /dev/null
+++ b/configs/regnet/regnetx-12gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+ backbone=dict(type='RegNet', arch='regnetx_12gf'),
+ head=dict(in_channels=2240, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-3.2gf_8xb64_in1k.py b/configs/regnet/regnetx-3.2gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a78478d6df89eee57960f239069192a7d529682e
--- /dev/null
+++ b/configs/regnet/regnetx-3.2gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+ backbone=dict(type='RegNet', arch='regnetx_3.2gf'),
+ head=dict(in_channels=1008, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-4.0gf_8xb64_in1k.py b/configs/regnet/regnetx-4.0gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfc241fe0c8469ae3b8d522b7da7fb2da49f39de
--- /dev/null
+++ b/configs/regnet/regnetx-4.0gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+ backbone=dict(type='RegNet', arch='regnetx_4.0gf'),
+ head=dict(in_channels=1360, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-400mf_8xb128_in1k.py b/configs/regnet/regnetx-400mf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bad16785c04ad49db3b125fdcb343aa4c559cdd9
--- /dev/null
+++ b/configs/regnet/regnetx-400mf_8xb128_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+ '../_base_/models/regnet/regnetx_400mf.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs1024_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset settings
+data_preprocessor = dict(
+ # BGR format normalization parameters
+ mean=[103.53, 116.28, 123.675],
+ std=[57.375, 57.12, 58.395],
+ to_rgb=False, # The checkpoints from PyCls requires BGR format inputs.
+)
+
+# lighting params, in order of BGR, from repo. pycls
+EIGVAL = [0.2175, 0.0188, 0.0045]
+EIGVEC = [
+ [-0.5836, -0.6948, 0.4203],
+ [-0.5808, -0.0045, -0.814],
+ [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='Lighting',
+ eigval=EIGVAL,
+ eigvec=EIGVEC,
+ alphastd=25.5, # because the value range of images is [0,255]
+ to_rgb=False),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128)
+test_dataloader = dict(batch_size=128)
+
+# schedule settings
+
+# sgd with nesterov, base ls is 0.8 for batch_size 1024,
+optim_wrapper = dict(optimizer=dict(lr=0.8, nesterov=True))
+
+# runtime settings
+
+# Precise BN hook will update the bn stats, so this hook should be executed
+# before CheckpointHook(priority of 'VERY_LOW') and
+# EMAHook(priority of 'NORMAL') So set the priority of PreciseBNHook to
+# 'ABOVENORMAL' here.
+custom_hooks = [
+ dict(
+ type='PreciseBNHook',
+ num_samples=8192,
+ interval=1,
+ priority='ABOVE_NORMAL')
+]
diff --git a/configs/regnet/regnetx-6.4gf_8xb64_in1k.py b/configs/regnet/regnetx-6.4gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..edb1fb8e482cd51f44c377c493f00c3e6d7185ad
--- /dev/null
+++ b/configs/regnet/regnetx-6.4gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+ backbone=dict(type='RegNet', arch='regnetx_6.4gf'),
+ head=dict(in_channels=1624, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-8.0gf_8xb64_in1k.py b/configs/regnet/regnetx-8.0gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..04b75bbe25987a6b10a984f264288e6c90b29719
--- /dev/null
+++ b/configs/regnet/regnetx-8.0gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+ backbone=dict(type='RegNet', arch='regnetx_8.0gf'),
+ head=dict(in_channels=1920, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-800mf_8xb128_in1k.py b/configs/regnet/regnetx-800mf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cd71379a108703f5ca3ce7f4f156227085045aa
--- /dev/null
+++ b/configs/regnet/regnetx-800mf_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+ backbone=dict(type='RegNet', arch='regnetx_800mf'),
+ head=dict(in_channels=672, ))
diff --git a/configs/replknet/README.md b/configs/replknet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3d312f24aa95837c056892cea315458749558206
--- /dev/null
+++ b/configs/replknet/README.md
@@ -0,0 +1,108 @@
+# RepLKNet
+
+> [Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs](https://arxiv.org/abs/2203.06717)
+
+
+
+## Abstract
+
+We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm. We suggested five guidelines, e.g., applying re-parameterized large depth-wise convolutions, to design efficient highperformance large-kernel CNNs. Following the guidelines, we propose RepLKNet, a pure CNN architecture whose kernel size is as large as 31×31, in contrast to commonly used 3×3. RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer on ImageNet and a few typical downstream tasks, with lower latency. RepLKNet also shows nice scalability to big data and large models, obtaining 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K, which is very competitive among the state-of-the-arts with similar model sizes. Our study further reveals that, in contrast to small-kernel CNNs, large kernel CNNs have much larger effective receptive fields and higher shape bias rather than texture bias.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model, get_model
+
+model = get_model('replknet-31B_3rdparty_in1k', pretrained=True)
+model.backbone.switch_to_deploy()
+predict = inference_model(model, 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('replknet-31B_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/replknet/replknet-31B_32xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-fd08e268.pth
+```
+
+**Reparameterization**
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+To use reparameterized weights, the config file must switch to the deploy config files.
+
+```bash
+python tools/test.py ${deploy_cfg} ${deploy_checkpoint} --metrics accuracy
+```
+
+You can also use `backbone.switch_to_deploy()` to switch to the deploy mode in Python code. For example:
+
+```python
+from mmpretrain.models import RepLKNet
+
+backbone = RepLKNet(arch='31B')
+backbone.switch_to_deploy()
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :--------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :------------------------------------------------------------: |
+| `replknet-31B_3rdparty_in1k`\* | From scratch | 79.86 | 15.64 | 83.48 | 96.57 | [config](replknet-31B_32xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-fd08e268.pth) |
+| `replknet-31B_3rdparty_in1k-384px`\* | From scratch | 79.86 | 45.95 | 84.84 | 97.34 | [config](replknet-31B_32xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k-384px_20221118-03a170ce.pth) |
+| `replknet-31B_in21k-pre_3rdparty_in1k`\* | ImageNet-21k | 79.86 | 15.64 | 85.20 | 97.56 | [config](replknet-31B_32xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k_20221118-54ed5c46.pth) |
+| `replknet-31B_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 79.86 | 45.95 | 85.99 | 97.75 | [config](replknet-31B_32xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k-384px_20221118-76c92b24.pth) |
+| `replknet-31L_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 172.67 | 97.24 | 86.63 | 98.00 | [config](replknet-31L_32xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31L_in21k-pre_3rdparty_in1k-384px_20221118-dc3fc07c.pth) |
+| `replknet-XL_meg73m-pre_3rdparty_in1k-320px`\* | MEG73M | 335.44 | 129.57 | 87.57 | 98.39 | [config](replknet-XL_32xb64_in1k-320px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-XL_meg73m-pre_3rdparty_in1k-320px_20221118-88259b1d.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{ding2022scaling,
+ title={Scaling up your kernels to 31x31: Revisiting large kernel design in cnns},
+ author={Ding, Xiaohan and Zhang, Xiangyu and Han, Jungong and Ding, Guiguang},
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+ pages={11963--11975},
+ year={2022}
+}
+```
diff --git a/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a14fe63efafbff3f249a2e4d5b2c96de931c6c1f
--- /dev/null
+++ b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-31B_32xb64_in1k-384px.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f92c494f8afd0d494e199de20f26af7ce151aa1
--- /dev/null
+++ b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-31B_32xb64_in1k.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py b/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..63e590f9786173d879b1f4390c91392f1df45bec
--- /dev/null
+++ b/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-31L_32xb64_in1k-384px.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py b/configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0a8ed5f8f30aea7e53811ae63767187d5494bc6
--- /dev/null
+++ b/configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-XL_32xb64_in1k-320px.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/metafile.yml b/configs/replknet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f9f37449778415e0de57394adb457c8bc57c9e2b
--- /dev/null
+++ b/configs/replknet/metafile.yml
@@ -0,0 +1,129 @@
+Collections:
+ - Name: RepLKNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Large-Kernel Convolution
+ - VGG-style Neural Network
+ Paper:
+ URL: https://arxiv.org/abs/2203.06717
+ Title: 'Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs'
+ README: configs/replknet/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/backbones/replknet.py
+ Version: v1.0.0rc3
+
+Models:
+ - Name: replknet-31B_3rdparty_in1k
+ In Collection: RepLKNet
+ Config: configs/replknet/replknet-31B_32xb64_in1k.py
+ Metadata:
+ FLOPs: 15636547584
+ Parameters: 79864168
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 83.48
+ Top 5 Accuracy: 96.57
+ Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-fd08e268.pth
+ Converted From:
+ Weights: https://drive.google.com/u/0/uc?id=1azQUiCxK9feYVkkrPqwVPBtNsTzDrX7S&export=download
+ Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+ - Name: replknet-31B_3rdparty_in1k-384px
+ In Collection: RepLKNet
+ Config: configs/replknet/replknet-31B_32xb64_in1k-384px.py
+ Metadata:
+ FLOPs: 45952303104
+ Parameters: 79864168
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 84.84
+ Top 5 Accuracy: 97.34
+ Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k-384px_20221118-03a170ce.pth
+ Converted From:
+ Weights: https://drive.google.com/u/0/uc?id=1vo-P3XB6mRLUeDzmgv90dOu73uCeLfZN&export=download
+ Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+ - Name: replknet-31B_in21k-pre_3rdparty_in1k
+ In Collection: RepLKNet
+ Config: configs/replknet/replknet-31B_32xb64_in1k.py
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 15636547584
+ Parameters: 79864168
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.20
+ Top 5 Accuracy: 97.56
+ Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k_20221118-54ed5c46.pth
+ Converted From:
+ Weights: https://drive.google.com/u/0/uc?id=1DslZ2voXZQR1QoFY9KnbsHAeF84hzS0s&export=download
+ Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+ - Name: replknet-31B_in21k-pre_3rdparty_in1k-384px
+ In Collection: RepLKNet
+ Config: configs/replknet/replknet-31B_32xb64_in1k-384px.py
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 45952303104
+ Parameters: 79864168
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.99
+ Top 5 Accuracy: 97.75
+ Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k-384px_20221118-76c92b24.pth
+ Converted From:
+ Weights: https://drive.google.com/u/0/uc?id=1Sc46BWdXXm2fVP-K_hKKU_W8vAB-0duX&export=download
+ Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+ - Name: replknet-31L_in21k-pre_3rdparty_in1k-384px
+ In Collection: RepLKNet
+ Config: configs/replknet/replknet-31L_32xb64_in1k-384px.py
+ Metadata:
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ FLOPs: 97240006656
+ Parameters: 172671016
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 86.63
+ Top 5 Accuracy: 98.00
+ Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31L_in21k-pre_3rdparty_in1k-384px_20221118-dc3fc07c.pth
+ Converted From:
+ Weights: https://drive.google.com/u/0/uc?id=1JYXoNHuRvC33QV1pmpzMTKEni1hpWfBl&export=download
+ Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+ - Name: replknet-XL_meg73m-pre_3rdparty_in1k-320px
+ In Collection: RepLKNet
+ Config: configs/replknet/replknet-XL_32xb64_in1k-320px.py
+ Metadata:
+ Training Data:
+ - MegData-73M
+ - ImageNet-1k
+ FLOPs: 129570201600
+ Parameters: 335435752
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 87.57
+ Top 5 Accuracy: 98.39
+ Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-XL_meg73m-pre_3rdparty_in1k-320px_20221118-88259b1d.pth
+ Converted From:
+ Weights: https://drive.google.com/u/0/uc?id=1tPC60El34GntXByIRHb-z-Apm4Y5LX1T&export=download
+ Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
diff --git a/configs/replknet/replknet-31B_32xb64_in1k-384px.py b/configs/replknet/replknet-31B_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e714f347a40101f2baf41a0723181a8502af85a
--- /dev/null
+++ b/configs/replknet/replknet-31B_32xb64_in1k-384px.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/replknet-31B_in1k.py',
+ '../_base_/datasets/imagenet_bs16_pil_bicubic_384.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+ type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/replknet/replknet-31B_32xb64_in1k.py b/configs/replknet/replknet-31B_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf06f2d86a39450574747d670f4bb9a7dfffaca6
--- /dev/null
+++ b/configs/replknet/replknet-31B_32xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/replknet-31B_in1k.py',
+ '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+ type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/replknet/replknet-31L_32xb64_in1k-384px.py b/configs/replknet/replknet-31L_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cdab249fefba7b7878211479b682768538c4b27
--- /dev/null
+++ b/configs/replknet/replknet-31L_32xb64_in1k-384px.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/replknet-31L_in1k.py',
+ '../_base_/datasets/imagenet_bs16_pil_bicubic_384.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+ type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/replknet/replknet-XL_32xb64_in1k-320px.py b/configs/replknet/replknet-XL_32xb64_in1k-320px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b0aab114e725e822dbffb99a637cc9e770a91e7
--- /dev/null
+++ b/configs/replknet/replknet-XL_32xb64_in1k-320px.py
@@ -0,0 +1,12 @@
+_base_ = [
+ '../_base_/models/replknet-XL_in1k.py',
+ '../_base_/datasets/imagenet_bs8_pil_bicubic_320.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+ type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/repmlp/README.md b/configs/repmlp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..41dfa234bd09153695a09af39b3901e536ca19b6
--- /dev/null
+++ b/configs/repmlp/README.md
@@ -0,0 +1,103 @@
+# RepMLP
+
+> [RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition](https://arxiv.org/abs/2105.01883)
+
+
+
+## Abstract
+
+We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition).
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model, get_model
+
+model = get_model('repmlp-base_3rdparty_8xb64_in1k', pretrained=True)
+model.backbone.switch_to_deploy()
+predict = inference_model(model, 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('repmlp-base_3rdparty_8xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/repmlp/repmlp-base_8xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth
+```
+
+**Reparameterization**
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+To use reparameterized weights, the config file must switch to the deploy config files.
+
+```bash
+python tools/test.py ${deploy_cfg} ${deploy_checkpoint} --metrics accuracy
+```
+
+You can also use `backbone.switch_to_deploy()` to switch to the deploy mode in Python code. For example:
+
+```python
+from mmpretrain.models import RepMLPNet
+
+backbone = RepMLPNet(arch='B', img_size=224, reparam_conv_kernels=(1, 3))
+backbone.switch_to_deploy()
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :-------------------------------------------------------------------: |
+| `repmlp-base_3rdparty_8xb64_in1k`\* | From scratch | 68.24 | 6.71 | 80.41 | 95.14 | [config](repmlp-base_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth) |
+| `repmlp-base_3rdparty_8xb64_in1k-256px`\* | From scratch | 96.45 | 9.69 | 81.11 | 95.50 | [config](repmlp-base_8xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k-256px_20220330-7c5a91ce.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepMLP/blob/072d8516beba83d75dfe6ebb12f625abad4b53d5/repmlpnet.py#L278). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{ding2021repmlp,
+ title={Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition},
+ author={Ding, Xiaohan and Xia, Chunlong and Zhang, Xiangyu and Chu, Xiaojie and Han, Jungong and Ding, Guiguang},
+ journal={arXiv preprint arXiv:2105.01883},
+ year={2021}
+}
+```
diff --git a/configs/repmlp/metafile.yml b/configs/repmlp/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..7f391e04b7cfc2b3ffc93dbd2a781e6b201d1cde
--- /dev/null
+++ b/configs/repmlp/metafile.yml
@@ -0,0 +1,48 @@
+Collections:
+ - Name: RepMLP
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Multi-layer Perceptron
+ - Re-parameterization Convolution
+ Paper:
+ URL: https://arxiv.org/abs/2105.01883
+ Title: 'RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition'
+ README: configs/repmlp/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.21.0/mmcls/models/backbones/repmlp.py
+ Version: v0.21.0
+
+Models:
+ - Name: repmlp-base_3rdparty_8xb64_in1k
+ In Collection: RepMLP
+ Config: configs/repmlp/repmlp-base_8xb64_in1k.py
+ Metadata:
+ FLOPs: 6710000000 # 6.71 G
+ Parameters: 68240000 # 68.24 M
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.41
+ Top 5 Accuracy: 95.14
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth
+ Converted From:
+ Weights: https://github.com/DingXiaoH/RepMLP
+ Code: https://github.com/DingXiaoH/RepMLP/blob/072d8516beba83d75dfe6ebb12f625abad4b53d5/repmlpnet.py#L274
+ - Name: repmlp-base_3rdparty_8xb64_in1k-256px
+ In Collection: RepMLP
+ Config: configs/repmlp/repmlp-base_8xb64_in1k-256px.py
+ Metadata:
+ FLOPs: 9690000000 # 9.69 G
+ Parameters: 96450000 # 96.45M
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.11
+ Top 5 Accuracy: 95.50
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k-256px_20220330-7c5a91ce.pth
+ Converted From:
+ Weights: https://github.com/DingXiaoH/RepMLP
+ Code: https://github.com/DingXiaoH/RepMLP/blob/072d8516beba83d75dfe6ebb12f625abad4b53d5/repmlpnet.py#L278
diff --git a/configs/repmlp/repmlp-base_8xb64_in1k-256px.py b/configs/repmlp/repmlp-base_8xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..81dc55a204918dec83b31c80cd37125a4ce3bb27
--- /dev/null
+++ b/configs/repmlp/repmlp-base_8xb64_in1k-256px.py
@@ -0,0 +1,36 @@
+_base_ = [
+ '../_base_/models/repmlp-base_224.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=256))
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=256),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=292, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=256),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/repmlp/repmlp-base_8xb64_in1k.py b/configs/repmlp/repmlp-base_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..666ce405440d7c764a0959900cc3650f329cc019
--- /dev/null
+++ b/configs/repmlp/repmlp-base_8xb64_in1k.py
@@ -0,0 +1,26 @@
+_base_ = [
+ '../_base_/models/repmlp-base_224.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset settings
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ # resizing to (256, 256) here, different from resizing shorter edge to 256
+ dict(type='Resize', scale=(256, 256), backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/repmlp/repmlp-base_delopy_8xb64_in1k.py b/configs/repmlp/repmlp-base_delopy_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5b2c882341421f225b0b3ca0b57e2efd6c06e07
--- /dev/null
+++ b/configs/repmlp/repmlp-base_delopy_8xb64_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['./repmlp-base_8xb64_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/repmlp/repmlp-base_deploy_8xb64_in1k-256px.py b/configs/repmlp/repmlp-base_deploy_8xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..27ff50a02dc65c56162e7f851506f00dbb6bc8da
--- /dev/null
+++ b/configs/repmlp/repmlp-base_deploy_8xb64_in1k-256px.py
@@ -0,0 +1,3 @@
+_base_ = ['./repmlp-base_8xb64_in1k-256px.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/repvgg/README.md b/configs/repvgg/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9a47f9d1e0a56a027072b661aef54225f1423205
--- /dev/null
+++ b/configs/repvgg/README.md
@@ -0,0 +1,142 @@
+# RepVGG
+
+> [RepVGG: Making VGG-style ConvNets Great Again](https://arxiv.org/abs/2101.03697)
+
+
+
+## Introduction
+
+RepVGG is a VGG-style convolutional architecture. It has the following advantages:
+
+1. The model has a VGG-like plain (a.k.a. feed-forward) topology 1 without any branches. I.e., every layer takes the output of its only preceding layer as input and feeds the output into its only following layer.
+2. The model’s body uses only 3 × 3 conv and ReLU.
+3. The concrete architecture (including the specific depth and layer widths) is instantiated with no automatic search, manual refinement, compound scaling, nor other heavy designs.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model, get_model
+
+model = get_model('repvgg-A0_8xb32_in1k', pretrained=True)
+model.backbone.switch_to_deploy()
+predict = inference_model(model, 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('repvgg-A0_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/repvgg/repvgg-A0_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/repvgg/repvgg-A0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.pth
+```
+
+Test with reparameterized model:
+
+```shell
+python tools/test.py configs/repvgg/repvgg-A0_8xb32_in1k.py repvgg_A0_deploy.pth --cfg-options model.backbone.deploy=True
+```
+
+**Reparameterization**
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+To use reparameterized weights, the config file must switch to the deploy config files.
+
+```bash
+python tools/test.py ${deploy_cfg} ${deploy_checkpoint} --metrics accuracy
+```
+
+You can also use `backbone.switch_to_deploy()` to switch to the deploy mode in Python code. For example:
+
+```python
+from mmpretrain.models import RepVGG
+
+backbone = RepVGG(arch='A0')
+backbone.switch_to_deploy()
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------------------------------------: |
+| `repvgg-A0_8xb32_in1k` | From scratch | 8.31 | 1.36 | 72.37 | 90.56 | [config](repvgg-A0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.log) |
+| `repvgg-A1_8xb32_in1k` | From scratch | 12.79 | 2.36 | 74.23 | 91.80 | [config](repvgg-A1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_8xb32_in1k_20221213-f81bf3df.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_8xb32_in1k_20221213-f81bf3df.log) |
+| `repvgg-A2_8xb32_in1k` | From scratch | 25.50 | 5.12 | 76.49 | 93.09 | [config](repvgg-A2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_8xb32_in1k_20221213-a8767caf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_8xb32_in1k_20221213-a8767caf.log) |
+| `repvgg-B0_8xb32_in1k` | From scratch | 3.42 | 15.82 | 75.27 | 92.21 | [config](repvgg-B0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_8xb32_in1k_20221213-5091ecc7.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_8xb32_in1k_20221213-5091ecc7.log) |
+| `repvgg-B1_8xb32_in1k` | From scratch | 51.83 | 11.81 | 78.19 | 94.04 | [config](repvgg-B1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_8xb32_in1k_20221213-d17c45e7.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_8xb32_in1k_20221213-d17c45e7.log) |
+| `repvgg-B1g2_8xb32_in1k` | From scratch | 41.36 | 8.81 | 77.87 | 93.99 | [config](repvgg-B1g2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_8xb32_in1k_20221213-ae6428fd.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_8xb32_in1k_20221213-ae6428fd.log) |
+| `repvgg-B1g4_8xb32_in1k` | From scratch | 36.13 | 7.30 | 77.81 | 93.77 | [config](repvgg-B1g4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_8xb32_in1k_20221213-a7a4aaea.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_8xb32_in1k_20221213-a7a4aaea.log) |
+| `repvgg-B2_8xb32_in1k` | From scratch | 80.32 | 18.37 | 78.58 | 94.23 | [config](repvgg-B2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_8xb32_in1k_20221213-d8b420ef.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_8xb32_in1k_20221213-d8b420ef.log) |
+| `repvgg-B2g4_8xb32_in1k` | From scratch | 55.78 | 11.33 | 79.44 | 94.72 | [config](repvgg-B2g4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_8xb32_in1k_20221213-0c1990eb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_8xb32_in1k_20221213-0c1990eb.log) |
+| `repvgg-B3_8xb32_in1k` | From scratch | 110.96 | 26.21 | 80.58 | 95.33 | [config](repvgg-B3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_8xb32_in1k_20221213-927a329a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_8xb32_in1k_20221213-927a329a.log) |
+| `repvgg-B3g4_8xb32_in1k` | From scratch | 75.63 | 16.06 | 80.26 | 95.15 | [config](repvgg-B3g4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_8xb32_in1k_20221213-e01cb280.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_8xb32_in1k_20221213-e01cb280.log) |
+| `repvgg-D2se_3rdparty_in1k`\* | From scratch | 120.39 | 32.84 | 81.81 | 95.94 | [config](repvgg-D2se_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-D2se_3rdparty_4xb64-autoaug-lbs-mixup-coslr-200e_in1k_20210909-cf3139b7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepVGG/blob/9f272318abfc47a2b702cd0e916fca8d25d683e7/repvgg.py#L250). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{ding2021repvgg,
+ title={Repvgg: Making vgg-style convnets great again},
+ author={Ding, Xiaohan and Zhang, Xiangyu and Ma, Ningning and Han, Jungong and Ding, Guiguang and Sun, Jian},
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+ pages={13733--13742},
+ year={2021}
+}
+```
diff --git a/configs/repvgg/metafile.yml b/configs/repvgg/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e93250ae2288b2ace58081bdcc24fc80c2f3c5b5
--- /dev/null
+++ b/configs/repvgg/metafile.yml
@@ -0,0 +1,175 @@
+Collections:
+ - Name: RepVGG
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - re-parameterization Convolution
+ - VGG-style Neural Network
+ Paper:
+ URL: https://arxiv.org/abs/2101.03697
+ Title: 'RepVGG: Making VGG-style ConvNets Great Again'
+ README: configs/repvgg/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.16.0/mmcls/models/backbones/repvgg.py#L257
+ Version: v0.16.0
+
+Models:
+ - Name: repvgg-A0_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-A0_8xb32_in1k.py
+ Metadata:
+ FLOPs: 1360233728
+ Parameters: 8309384
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 72.37
+ Top 5 Accuracy: 90.56
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.pth
+ - Name: repvgg-A1_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-A1_8xb32_in1k.py
+ Metadata:
+ FLOPs: 2362750208
+ Parameters: 12789864
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 74.23
+ Top 5 Accuracy: 91.80
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_8xb32_in1k_20221213-f81bf3df.pth
+ - Name: repvgg-A2_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-A2_8xb32_in1k.py
+ Metadata:
+ FLOPs: 5115612544
+ Parameters: 25499944
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 76.49
+ Top 5 Accuracy: 93.09
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_8xb32_in1k_20221213-a8767caf.pth
+ - Name: repvgg-B0_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-B0_8xb32_in1k.py
+ Metadata:
+ FLOPs: 15820000000
+ Parameters: 3420000
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 75.27
+ Top 5 Accuracy: 92.21
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_8xb32_in1k_20221213-5091ecc7.pth
+ - Name: repvgg-B1_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-B1_8xb32_in1k.py
+ Metadata:
+ FLOPs: 11813537792
+ Parameters: 51829480
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 78.19
+ Top 5 Accuracy: 94.04
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_8xb32_in1k_20221213-d17c45e7.pth
+ - Name: repvgg-B1g2_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-B1g2_8xb32_in1k.py
+ Metadata:
+ FLOPs: 8807794688
+ Parameters: 41360104
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 77.87
+ Top 5 Accuracy: 93.99
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_8xb32_in1k_20221213-ae6428fd.pth
+ - Name: repvgg-B1g4_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-B1g4_8xb32_in1k.py
+ Metadata:
+ FLOPs: 7304923136
+ Parameters: 36125416
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 77.81
+ Top 5 Accuracy: 93.77
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_8xb32_in1k_20221213-a7a4aaea.pth
+ - Name: repvgg-B2_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-B2_8xb32_in1k.py
+ Metadata:
+ FLOPs: 18374175232
+ Parameters: 80315112
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 78.58
+ Top 5 Accuracy: 94.23
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_8xb32_in1k_20221213-d8b420ef.pth
+ - Name: repvgg-B2g4_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-B2g4_8xb32_in1k.py
+ Metadata:
+ FLOPs: 11329464832
+ Parameters: 55777512
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 79.44
+ Top 5 Accuracy: 94.72
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_8xb32_in1k_20221213-0c1990eb.pth
+ - Name: repvgg-B3_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-B3_8xb32_in1k.py
+ Metadata:
+ FLOPs: 26206448128
+ Parameters: 110960872
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 80.58
+ Top 5 Accuracy: 95.33
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_8xb32_in1k_20221213-927a329a.pth
+ - Name: repvgg-B3g4_8xb32_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-B3g4_8xb32_in1k.py
+ Metadata:
+ FLOPs: 16062065152
+ Parameters: 75626728
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 80.26
+ Top 5 Accuracy: 95.15
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_8xb32_in1k_20221213-e01cb280.pth
+ - Name: repvgg-D2se_3rdparty_in1k
+ In Collection: RepVGG
+ Config: configs/repvgg/repvgg-D2se_8xb32_in1k.py
+ Metadata:
+ FLOPs: 32838581760
+ Parameters: 120387572
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 81.81
+ Top 5 Accuracy: 95.94
+ Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-D2se_3rdparty_4xb64-autoaug-lbs-mixup-coslr-200e_in1k_20210909-cf3139b7.pth
+ Converted From:
+ Weights: https://drive.google.com/drive/folders/1Avome4KvNp0Lqh2QwhXO6L5URQjzCjUq
+ Code: https://github.com/DingXiaoH/RepVGG/blob/9f272318abfc47a2b702cd0e916fca8d25d683e7/repvgg.py#L250
diff --git a/configs/repvgg/repvgg-A0_8xb32_in1k.py b/configs/repvgg/repvgg-A0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b767ae2a3e4062563cec782385baafdf6181baf3
--- /dev/null
+++ b/configs/repvgg/repvgg-A0_8xb32_in1k.py
@@ -0,0 +1,33 @@
+_base_ = [
+ '../_base_/models/repvgg-A0_in1k.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+ paramwise_cfg=dict(
+ bias_decay_mult=0.0,
+ custom_keys={
+ 'branch_3x3.norm': dict(decay_mult=0.0),
+ 'branch_1x1.norm': dict(decay_mult=0.0),
+ 'branch_norm.bias': dict(decay_mult=0.0),
+ }))
+
+# schedule settings
+param_scheduler = dict(
+ type='CosineAnnealingLR',
+ T_max=120,
+ by_epoch=True,
+ begin=0,
+ end=120,
+ convert_to_iter_based=True)
+
+train_cfg = dict(by_epoch=True, max_epochs=120)
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/repvgg/repvgg-A0_deploy_in1k.py b/configs/repvgg/repvgg-A0_deploy_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..897e2bb36e9ad8197b4889f22530a32a79fef055
--- /dev/null
+++ b/configs/repvgg/repvgg-A0_deploy_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/repvgg/repvgg-A1_8xb32_in1k.py b/configs/repvgg/repvgg-A1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fab5e586359370dd59a7ba55b91511541e922a11
--- /dev/null
+++ b/configs/repvgg/repvgg-A1_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='A1'))
diff --git a/configs/repvgg/repvgg-A2_8xb32_in1k.py b/configs/repvgg/repvgg-A2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6196f02fbfedb36e9e498160884eeb7315513f6
--- /dev/null
+++ b/configs/repvgg/repvgg-A2_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='A2'), head=dict(in_channels=1408))
diff --git a/configs/repvgg/repvgg-B0_8xb32_in1k.py b/configs/repvgg/repvgg-B0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bbc4ab2259ccd929eae948cae0f676b7fca4b74
--- /dev/null
+++ b/configs/repvgg/repvgg-B0_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B0'), head=dict(in_channels=1280))
diff --git a/configs/repvgg/repvgg-B1_8xb32_in1k.py b/configs/repvgg/repvgg-B1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e08db3c4b8145cd3141851a7b41bbbe4fbfff776
--- /dev/null
+++ b/configs/repvgg/repvgg-B1_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B1'), head=dict(in_channels=2048))
diff --git a/configs/repvgg/repvgg-B1g2_8xb32_in1k.py b/configs/repvgg/repvgg-B1g2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1c53fded4e0ff0c59038fb82ca8cb0ca3e41742
--- /dev/null
+++ b/configs/repvgg/repvgg-B1g2_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B1g2'), head=dict(in_channels=2048))
diff --git a/configs/repvgg/repvgg-B1g4_8xb32_in1k.py b/configs/repvgg/repvgg-B1g4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0757b1e580e5091b9d5c633cd87c856a526ebdf0
--- /dev/null
+++ b/configs/repvgg/repvgg-B1g4_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B1g4'), head=dict(in_channels=2048))
diff --git a/configs/repvgg/repvgg-B2_8xb32_in1k.py b/configs/repvgg/repvgg-B2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9a7d4ca5570518f0c4d0b81951e0e97c46606f9
--- /dev/null
+++ b/configs/repvgg/repvgg-B2_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B2'), head=dict(in_channels=2560))
diff --git a/configs/repvgg/repvgg-B2g4_8xb32_in1k.py b/configs/repvgg/repvgg-B2g4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b3397881d74785870c266f1212cfee364dab38d
--- /dev/null
+++ b/configs/repvgg/repvgg-B2g4_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-B3_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B2g4'), head=dict(in_channels=2560))
diff --git a/configs/repvgg/repvgg-B3_8xb32_in1k.py b/configs/repvgg/repvgg-B3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9d5257838c9e2061dfbe39aa2b1456820009ff3
--- /dev/null
+++ b/configs/repvgg/repvgg-B3_8xb32_in1k.py
@@ -0,0 +1,67 @@
+_base_ = [
+ '../_base_/models/repvgg-B3_lbs-mixup_in1k.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(
+ paramwise_cfg=dict(
+ bias_decay_mult=0.0,
+ custom_keys={
+ 'branch_3x3.norm': dict(decay_mult=0.0),
+ 'branch_1x1.norm': dict(decay_mult=0.0),
+ 'branch_norm.bias': dict(decay_mult=0.0),
+ }))
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=7,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+param_scheduler = dict(
+ type='CosineAnnealingLR',
+ T_max=200,
+ by_epoch=True,
+ begin=0,
+ end=200,
+ convert_to_iter_based=True)
+
+train_cfg = dict(by_epoch=True, max_epochs=200)
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/repvgg/repvgg-B3g4_8xb32_in1k.py b/configs/repvgg/repvgg-B3g4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0c5c00af845f5e4f02b44105095f78835f35096
--- /dev/null
+++ b/configs/repvgg/repvgg-B3g4_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-B3_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B3g4'))
diff --git a/configs/repvgg/repvgg-D2se_8xb32_in1k.py b/configs/repvgg/repvgg-D2se_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f532dcd79686a119e1bed528a1e7c36195e70857
--- /dev/null
+++ b/configs/repvgg/repvgg-D2se_8xb32_in1k.py
@@ -0,0 +1,28 @@
+_base_ = './repvgg-B3_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='D2se'), head=dict(in_channels=2560))
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=295,
+ eta_min=1.0e-6,
+ by_epoch=True,
+ begin=5,
+ end=300)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
+
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/res2net/README.md b/configs/res2net/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..68b1acce79c18d994d2e310392a75a4b74db6078
--- /dev/null
+++ b/configs/res2net/README.md
@@ -0,0 +1,78 @@
+# Res2Net
+
+> [Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/abs/1904.01169)
+
+
+
+## Abstract
+
+Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('res2net50-w14-s8_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('res2net50-w14-s8_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/res2net/res2net50-w14-s8_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :-------------------------------------------------------------------: |
+| `res2net50-w14-s8_3rdparty_8xb32_in1k`\* | From scratch | 25.06 | 4.22 | 78.14 | 93.85 | [config](res2net50-w14-s8_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth) |
+| `res2net50-w26-s8_3rdparty_8xb32_in1k`\* | From scratch | 48.40 | 8.39 | 79.20 | 94.36 | [config](res2net50-w26-s8_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w26-s8_3rdparty_8xb32_in1k_20210927-f547a94b.pth) |
+| `res2net101-w26-s4_3rdparty_8xb32_in1k`\* | From scratch | 45.21 | 8.12 | 79.19 | 94.44 | [config](res2net101-w26-s4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net101-w26-s4_3rdparty_8xb32_in1k_20210927-870b6c36.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L181). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{gao2019res2net,
+ title={Res2Net: A New Multi-scale Backbone Architecture},
+ author={Gao, Shang-Hua and Cheng, Ming-Ming and Zhao, Kai and Zhang, Xin-Yu and Yang, Ming-Hsuan and Torr, Philip},
+ journal={IEEE TPAMI},
+ year={2021},
+ doi={10.1109/TPAMI.2019.2938758},
+}
+```
diff --git a/configs/res2net/metafile.yml b/configs/res2net/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..b19b102f443998a335362a43b0deb57e0bc264a5
--- /dev/null
+++ b/configs/res2net/metafile.yml
@@ -0,0 +1,70 @@
+Collections:
+ - Name: Res2Net
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Architecture:
+ - Batch Normalization
+ - Convolution
+ - Global Average Pooling
+ - ReLU
+ - Res2Net Block
+ Paper:
+ Title: 'Res2Net: A New Multi-scale Backbone Architecture'
+ URL: https://arxiv.org/abs/1904.01169
+ README: configs/res2net/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.17.0/mmcls/models/backbones/res2net.py
+ Version: v0.17.0
+
+Models:
+ - Name: res2net50-w14-s8_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 4220000000
+ Parameters: 25060000
+ In Collection: Res2Net
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.14
+ Top 5 Accuracy: 93.85
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth
+ Converted From:
+ Weights: https://1drv.ms/u/s!AkxDDnOtroRPdOTqhF8ne_aakDI?e=EVb8Ri
+ Code: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L221
+ Config: configs/res2net/res2net50-w14-s8_8xb32_in1k.py
+ - Name: res2net50-w26-s8_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 8390000000
+ Parameters: 48400000
+ In Collection: Res2Net
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.20
+ Top 5 Accuracy: 94.36
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w26-s8_3rdparty_8xb32_in1k_20210927-f547a94b.pth
+ Converted From:
+ Weights: https://1drv.ms/u/s!AkxDDnOtroRPdTrAd_Afzc26Z7Q?e=slYqsR
+ Code: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L201
+ Config: configs/res2net/res2net50-w26-s8_8xb32_in1k.py
+ - Name: res2net101-w26-s4_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 8120000000
+ Parameters: 45210000
+ In Collection: Res2Net
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.19
+ Top 5 Accuracy: 94.44
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/res2net/res2net101-w26-s4_3rdparty_8xb32_in1k_20210927-870b6c36.pth
+ Converted From:
+ Weights: https://1drv.ms/u/s!AkxDDnOtroRPcJRgTLkahL0cFYw?e=nwbnic
+ Code: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L181
+ Config: configs/res2net/res2net101-w26-s4_8xb32_in1k.py
diff --git a/configs/res2net/res2net101-w26-s4_8xb32_in1k.py b/configs/res2net/res2net101-w26-s4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ebe9e94d64a305a06dda71c3c20d8c6c77cfc06
--- /dev/null
+++ b/configs/res2net/res2net101-w26-s4_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/res2net101-w26-s4.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/res2net/res2net50-w14-s8_8xb32_in1k.py b/configs/res2net/res2net50-w14-s8_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..56cc02e3b893e4976940badabfa577db471620bc
--- /dev/null
+++ b/configs/res2net/res2net50-w14-s8_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/res2net50-w14-s8.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/res2net/res2net50-w26-s8_8xb32_in1k.py b/configs/res2net/res2net50-w26-s8_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d7dcbeb9164875b21aa782ac5bed5f4618a4363e
--- /dev/null
+++ b/configs/res2net/res2net50-w26-s8_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/res2net50-w26-s8.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnest/README.md b/configs/resnest/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..eb6c5fd728c3032b6b8c429100f1399b8803b765
--- /dev/null
+++ b/configs/resnest/README.md
@@ -0,0 +1,26 @@
+# ResNeSt
+
+> [ResNeSt: Split-Attention Networks](https://arxiv.org/abs/2004.08955)
+
+
+
+## Abstract
+
+It is well known that featuremap attention and multi-path representation are important for visual recognition. In this paper, we present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations. Our design results in a simple and unified computation block, which can be parameterized using only a few variables. Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification. In addition, ResNeSt has achieved superior transfer learning results on several public benchmarks serving as the backbone, and has been adopted by the winning entries of COCO-LVIS challenge. The source code for complete system and pretrained models are publicly available.
+
+
+

+
+
+## Citation
+
+```
+@misc{zhang2020resnest,
+ title={ResNeSt: Split-Attention Networks},
+ author={Hang Zhang and Chongruo Wu and Zhongyue Zhang and Yi Zhu and Haibin Lin and Zhi Zhang and Yue Sun and Tong He and Jonas Mueller and R. Manmatha and Mu Li and Alexander Smola},
+ year={2020},
+ eprint={2004.08955},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
diff --git a/configs/resnest/_randaug_policies.py b/configs/resnest/_randaug_policies.py
new file mode 100644
index 0000000000000000000000000000000000000000..d650caa2f586045ab76102a5506885e6da2fb4ed
--- /dev/null
+++ b/configs/resnest/_randaug_policies.py
@@ -0,0 +1,92 @@
+policies = [
+ dict(type='AutoContrast', prob=0.5),
+ dict(type='Equalize', prob=0.5),
+ dict(type='Invert', prob=0.5),
+ dict(
+ type='Rotate',
+ magnitude_key='angle',
+ magnitude_range=(0, 30),
+ pad_val=0,
+ prob=0.5,
+ random_negative_prob=0.5),
+ dict(
+ type='Posterize',
+ magnitude_key='bits',
+ magnitude_range=(0, 4),
+ prob=0.5),
+ dict(
+ type='Solarize',
+ magnitude_key='thr',
+ magnitude_range=(0, 256),
+ prob=0.5),
+ dict(
+ type='SolarizeAdd',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 110),
+ thr=128,
+ prob=0.5),
+ dict(
+ type='ColorTransform',
+ magnitude_key='magnitude',
+ magnitude_range=(-0.9, 0.9),
+ prob=0.5,
+ random_negative_prob=0.),
+ dict(
+ type='Contrast',
+ magnitude_key='magnitude',
+ magnitude_range=(-0.9, 0.9),
+ prob=0.5,
+ random_negative_prob=0.),
+ dict(
+ type='Brightness',
+ magnitude_key='magnitude',
+ magnitude_range=(-0.9, 0.9),
+ prob=0.5,
+ random_negative_prob=0.),
+ dict(
+ type='Sharpness',
+ magnitude_key='magnitude',
+ magnitude_range=(-0.9, 0.9),
+ prob=0.5,
+ random_negative_prob=0.),
+ dict(
+ type='Shear',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ pad_val=0,
+ prob=0.5,
+ direction='horizontal',
+ random_negative_prob=0.5),
+ dict(
+ type='Shear',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ pad_val=0,
+ prob=0.5,
+ direction='vertical',
+ random_negative_prob=0.5),
+ dict(
+ type='Cutout',
+ magnitude_key='shape',
+ magnitude_range=(1, 41),
+ pad_val=0,
+ prob=0.5),
+ dict(
+ type='Translate',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ pad_val=0,
+ prob=0.5,
+ direction='horizontal',
+ random_negative_prob=0.5,
+ interpolation='bicubic'),
+ dict(
+ type='Translate',
+ magnitude_key='magnitude',
+ magnitude_range=(0, 0.3),
+ pad_val=0,
+ prob=0.5,
+ direction='vertical',
+ random_negative_prob=0.5,
+ interpolation='bicubic')
+]
diff --git a/configs/resnest/resnest101_32xb64_in1k.py b/configs/resnest/resnest101_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac78659147a6fd1a56a89f56ed552ef3736488c4
--- /dev/null
+++ b/configs/resnest/resnest101_32xb64_in1k.py
@@ -0,0 +1,78 @@
+_base_ = [
+ '../_base_/models/resnest101.py',
+ '../_base_/datasets/imagenet_bs64.py',
+ '../_base_/default_runtime.py',
+ './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+ [-0.5836, -0.6948, 0.4203],
+ [-0.5808, -0.0045, -0.8140],
+ [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandAugment',
+ policies={{_base_.policies}},
+ num_policies=2,
+ magnitude_level=12),
+ dict(type='EfficientNetRandomCrop', scale=256, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='Lighting',
+ eigval=EIGVAL,
+ eigvec=EIGVEC,
+ alphastd=0.1,
+ to_rgb=False),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=256, backend='pillow'),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+ paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-6,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=265,
+ by_epoch=True,
+ begin=5,
+ end=270,
+ )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnest/resnest200_64xb32_in1k.py b/configs/resnest/resnest200_64xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3b9fb3d7dad8357829a820286f27ef0097426b6
--- /dev/null
+++ b/configs/resnest/resnest200_64xb32_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+ '../_base_/models/resnest200.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/default_runtime.py',
+ './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+ [-0.5836, -0.6948, 0.4203],
+ [-0.5808, -0.0045, -0.8140],
+ [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandAugment',
+ policies={{_base_.policies}},
+ num_policies=2,
+ magnitude_level=12),
+ dict(type='EfficientNetRandomCrop', scale=320, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='Lighting',
+ eigval=EIGVAL,
+ eigvec=EIGVEC,
+ alphastd=0.1,
+ to_rgb=False),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=320, backend='pillow'),
+ dict(type='PackInputs'),
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+ paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-6,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=265,
+ by_epoch=True,
+ begin=5,
+ end=270,
+ )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnest/resnest269_64xb32_in1k.py b/configs/resnest/resnest269_64xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e884d63586f8210143ca0bf1e9cf33b2449a4f9
--- /dev/null
+++ b/configs/resnest/resnest269_64xb32_in1k.py
@@ -0,0 +1,78 @@
+_base_ = [
+ '../_base_/models/resnest269.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/default_runtime.py',
+ './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+ [-0.5836, -0.6948, 0.4203],
+ [-0.5808, -0.0045, -0.8140],
+ [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandAugment',
+ policies={{_base_.policies}},
+ num_policies=2,
+ magnitude_level=12),
+ dict(type='EfficientNetRandomCrop', scale=416, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='Lighting',
+ eigval=EIGVAL,
+ eigvec=EIGVEC,
+ alphastd=0.1,
+ to_rgb=False),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=416, backend='pillow'),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+ paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-6,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=265,
+ by_epoch=True,
+ begin=5,
+ end=270,
+ )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnest/resnest50_32xb64_in1k.py b/configs/resnest/resnest50_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..05f839b38b669a3093a8a7df7f78f135b88e6b77
--- /dev/null
+++ b/configs/resnest/resnest50_32xb64_in1k.py
@@ -0,0 +1,78 @@
+_base_ = [
+ '../_base_/models/resnest50.py',
+ '../_base_/datasets/imagenet_bs64.py',
+ '../_base_/default_runtime.py',
+ './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+ [-0.5836, -0.6948, 0.4203],
+ [-0.5808, -0.0045, -0.8140],
+ [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandAugment',
+ policies={{_base_.policies}},
+ num_policies=2,
+ magnitude_level=12),
+ dict(type='EfficientNetRandomCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='Lighting',
+ eigval=EIGVAL,
+ eigvec=EIGVEC,
+ alphastd=0.1,
+ to_rgb=False),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='EfficientNetCenterCrop', crop_size=256, backend='pillow'),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+ paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-6,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=265,
+ by_epoch=True,
+ begin=5,
+ end=270,
+ )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnet/README.md b/configs/resnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..286b77381a57401607cc52568d1d81b8ba5b4d83
--- /dev/null
+++ b/configs/resnet/README.md
@@ -0,0 +1,140 @@
+# ResNet
+
+> [Deep Residual Learning for Image Recognition](https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)
+
+
+
+## Introduction
+
+**Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of
+learning unreferenced functions. In the mainstream previous works, like VGG, the neural networks are a stack
+of layers and every layer attempts to fit a desired underlying mapping. In ResNets, a few stacked layers are
+grouped as a block, and the layers in a block attempts to learn a residual mapping.
+
+Formally, denoting the desired underlying mapping of a block as $\mathcal{H}(x)$, split the underlying mapping
+into the sum of the identity and the residual mapping as $\mathcal{H}(x) = x + \mathcal{F}(x)$, and let the
+stacked non-linear layers fit the residual mapping $\mathcal{F}(x)$.
+
+Many works proved this method makes deep neural networks easier to optimize, and can gain accuracy from
+considerably increased depth. Recently, the residual structure is widely used in various models.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
+
+The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet18_8xb16_cifar10', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet18_8xb16_cifar10', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/resnet/resnet18_8xb16_cifar10.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :--------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-------------------------------------------: | :----------------------------------------------------------------------: |
+| `resnet18_8xb32_in1k` | From scratch | 11.69 | 1.82 | 69.90 | 89.43 | [config](resnet18_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.json) |
+| `resnet34_8xb32_in1k` | From scratch | 2.18 | 3.68 | 73.62 | 91.59 | [config](resnet34_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6.json) |
+| `resnet50_8xb32_in1k` | From scratch | 25.56 | 4.12 | 76.55 | 93.06 | [config](resnet50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.json) |
+| `resnet101_8xb32_in1k` | From scratch | 44.55 | 7.85 | 77.97 | 94.06 | [config](resnet101_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.json) |
+| `resnet152_8xb32_in1k` | From scratch | 60.19 | 11.58 | 78.48 | 94.13 | [config](resnet152_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa.json) |
+| `resnetv1d50_8xb32_in1k` | From scratch | 25.58 | 4.36 | 77.54 | 93.57 | [config](resnetv1d50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.json) |
+| `resnetv1d101_8xb32_in1k` | From scratch | 44.57 | 8.09 | 78.93 | 94.48 | [config](resnetv1d101_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.json) |
+| `resnetv1d152_8xb32_in1k` | From scratch | 60.21 | 11.82 | 79.41 | 94.70 | [config](resnetv1d152_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.json) |
+| `resnet50_8xb32-fp16_in1k` | From scratch | 25.56 | 4.12 | 76.30 | 93.07 | [config](resnet50_8xb32-fp16_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-b3964210.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-b3964210.json) |
+| `resnet50_8xb256-rsb-a1-600e_in1k` | From scratch | 25.56 | 4.12 | 80.12 | 94.78 | [config](resnet50_8xb256-rsb-a1-600e_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.json) |
+| `resnet50_8xb256-rsb-a2-300e_in1k` | From scratch | 25.56 | 4.12 | 79.55 | 94.37 | [config](resnet50_8xb256-rsb-a2-300e_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.json) |
+| `resnet50_8xb256-rsb-a3-100e_in1k` | From scratch | 25.56 | 4.12 | 78.30 | 93.80 | [config](resnet50_8xb256-rsb-a3-100e_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.json) |
+| `resnetv1c50_8xb32_in1k` | From scratch | 25.58 | 4.36 | 77.01 | 93.58 | [config](resnetv1c50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd.json) |
+| `resnetv1c101_8xb32_in1k` | From scratch | 44.57 | 8.09 | 78.30 | 94.27 | [config](resnetv1c101_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f.json) |
+| `resnetv1c152_8xb32_in1k` | From scratch | 60.21 | 11.82 | 78.76 | 94.41 | [config](resnetv1c152_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f.json) |
+
+### Image Classification on CIFAR-10
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------: |
+| `resnet18_8xb16_cifar10` | From scratch | 11.17 | 0.56 | 94.82 | [config](resnet18_8xb16_cifar10.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.json) |
+| `resnet34_8xb16_cifar10` | From scratch | 21.28 | 1.16 | 95.34 | [config](resnet34_8xb16_cifar10.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6.json) |
+| `resnet50_8xb16_cifar10` | From scratch | 23.52 | 1.31 | 95.55 | [config](resnet50_8xb16_cifar10.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.json) |
+| `resnet101_8xb16_cifar10` | From scratch | 42.51 | 2.52 | 95.58 | [config](resnet101_8xb16_cifar10.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936.json) |
+| `resnet152_8xb16_cifar10` | From scratch | 58.16 | 3.74 | 95.76 | [config](resnet152_8xb16_cifar10.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178.json) |
+
+### Image Classification on CIFAR-100
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `resnet50_8xb16_cifar100` | From scratch | 23.71 | 1.31 | 79.90 | 95.19 | [config](resnet50_8xb16_cifar100.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b.json) |
+
+### Image Classification on CUB-200-2011
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :------------------ | :----------: | :--------: | :-------: | :-------: | :----------------------------: | :-------------------------------------------------------------------------------------------------------------: |
+| `resnet50_8xb8_cub` | From scratch | 23.92 | 16.48 | 88.45 | [config](resnet50_8xb8_cub.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{he2016deep,
+ title={Deep residual learning for image recognition},
+ author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
+ booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+ pages={770--778},
+ year={2016}
+}
+```
diff --git a/configs/resnet/metafile.yml b/configs/resnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..16387248c43aea59c5563b4c6c98df8dd8effead
--- /dev/null
+++ b/configs/resnet/metafile.yml
@@ -0,0 +1,352 @@
+Collections:
+ - Name: ResNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Epochs: 100
+ Batch Size: 256
+ Architecture:
+ - ResNet
+ Paper:
+ URL: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
+ Title: "Deep Residual Learning for Image Recognition"
+ README: configs/resnet/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/resnet.py#L383
+ Version: v0.15.0
+
+Models:
+ - Name: resnet18_8xb16_cifar10
+ Metadata:
+ Training Data: CIFAR-10
+ Epochs: 200
+ Batch Size: 128
+ FLOPs: 560000000
+ Parameters: 11170000
+ In Collection: ResNet
+ Results:
+ - Dataset: CIFAR-10
+ Metrics:
+ Top 1 Accuracy: 94.82
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth
+ Config: configs/resnet/resnet18_8xb16_cifar10.py
+ - Name: resnet34_8xb16_cifar10
+ Metadata:
+ Training Data: CIFAR-10
+ Epochs: 200
+ Batch Size: 128
+ FLOPs: 1160000000
+ Parameters: 21280000
+ In Collection: ResNet
+ Results:
+ - Dataset: CIFAR-10
+ Metrics:
+ Top 1 Accuracy: 95.34
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6.pth
+ Config: configs/resnet/resnet34_8xb16_cifar10.py
+ - Name: resnet50_8xb16_cifar10
+ Metadata:
+ Training Data: CIFAR-10
+ Epochs: 200
+ Batch Size: 128
+ FLOPs: 1310000000
+ Parameters: 23520000
+ In Collection: ResNet
+ Results:
+ - Dataset: CIFAR-10
+ Metrics:
+ Top 1 Accuracy: 95.55
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth
+ Config: configs/resnet/resnet50_8xb16_cifar10.py
+ - Name: resnet101_8xb16_cifar10
+ Metadata:
+ Training Data: CIFAR-10
+ Epochs: 200
+ Batch Size: 128
+ FLOPs: 2520000000
+ Parameters: 42510000
+ In Collection: ResNet
+ Results:
+ - Dataset: CIFAR-10
+ Metrics:
+ Top 1 Accuracy: 95.58
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936.pth
+ Config: configs/resnet/resnet101_8xb16_cifar10.py
+ - Name: resnet152_8xb16_cifar10
+ Metadata:
+ Training Data: CIFAR-10
+ Epochs: 200
+ Batch Size: 128
+ FLOPs: 3740000000
+ Parameters: 58160000
+ In Collection: ResNet
+ Results:
+ - Dataset: CIFAR-10
+ Metrics:
+ Top 1 Accuracy: 95.76
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178.pth
+ Config: configs/resnet/resnet152_8xb16_cifar10.py
+ - Name: resnet50_8xb16_cifar100
+ Metadata:
+ Training Data: CIFAR-100
+ Epochs: 200
+ Batch Size: 128
+ FLOPs: 1310000000
+ Parameters: 23710000
+ In Collection: ResNet
+ Results:
+ - Dataset: CIFAR-100
+ Metrics:
+ Top 1 Accuracy: 79.90
+ Top 5 Accuracy: 95.19
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b.pth
+ Config: configs/resnet/resnet50_8xb16_cifar100.py
+ - Name: resnet18_8xb32_in1k
+ Metadata:
+ FLOPs: 1820000000
+ Parameters: 11690000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 69.90
+ Top 5 Accuracy: 89.43
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.pth
+ Config: configs/resnet/resnet18_8xb32_in1k.py
+ - Name: resnet34_8xb32_in1k
+ Metadata:
+ FLOPs: 3680000000
+ Parameters: 2180000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 73.62
+ Top 5 Accuracy: 91.59
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6.pth
+ Config: configs/resnet/resnet34_8xb32_in1k.py
+ - Name: resnet50_8xb32_in1k
+ Metadata:
+ FLOPs: 4120000000
+ Parameters: 25560000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.55
+ Top 5 Accuracy: 93.06
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth
+ Config: configs/resnet/resnet50_8xb32_in1k.py
+ - Name: resnet101_8xb32_in1k
+ Metadata:
+ FLOPs: 7850000000
+ Parameters: 44550000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.97
+ Top 5 Accuracy: 94.06
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.pth
+ Config: configs/resnet/resnet101_8xb32_in1k.py
+ - Name: resnet152_8xb32_in1k
+ Metadata:
+ FLOPs: 11580000000
+ Parameters: 60190000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.48
+ Top 5 Accuracy: 94.13
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa.pth
+ Config: configs/resnet/resnet152_8xb32_in1k.py
+ - Name: resnetv1d50_8xb32_in1k
+ Metadata:
+ FLOPs: 4360000000
+ Parameters: 25580000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.54
+ Top 5 Accuracy: 93.57
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.pth
+ Config: configs/resnet/resnetv1d50_8xb32_in1k.py
+ - Name: resnetv1d101_8xb32_in1k
+ Metadata:
+ FLOPs: 8090000000
+ Parameters: 44570000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.93
+ Top 5 Accuracy: 94.48
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.pth
+ Config: configs/resnet/resnetv1d101_8xb32_in1k.py
+ - Name: resnetv1d152_8xb32_in1k
+ Metadata:
+ FLOPs: 11820000000
+ Parameters: 60210000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.41
+ Top 5 Accuracy: 94.70
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.pth
+ Config: configs/resnet/resnetv1d152_8xb32_in1k.py
+ - Name: resnet50_8xb32-fp16_in1k
+ Metadata:
+ FLOPs: 4120000000
+ Parameters: 25560000
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ - Mixed Precision Training
+ In Collection: ResNet
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.30
+ Top 5 Accuracy: 93.07
+ Weights: https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-b3964210.pth
+ Config: configs/resnet/resnet50_8xb32-fp16_in1k.py
+ - Name: resnet50_8xb256-rsb-a1-600e_in1k
+ Metadata:
+ FLOPs: 4120000000
+ Parameters: 25560000
+ Training Techniques:
+ - LAMB
+ - Weight Decay
+ - Cosine Annealing
+ - Mixup
+ - CutMix
+ - RepeatAugSampler
+ - RandAugment
+ Epochs: 600
+ Batch Size: 2048
+ In Collection: ResNet
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.12
+ Top 5 Accuracy: 94.78
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.pth
+ Config: configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py
+ - Name: resnet50_8xb256-rsb-a2-300e_in1k
+ Metadata:
+ FLOPs: 4120000000
+ Parameters: 25560000
+ Training Techniques:
+ - LAMB
+ - Weight Decay
+ - Cosine Annealing
+ - Mixup
+ - CutMix
+ - RepeatAugSampler
+ - RandAugment
+ Epochs: 300
+ Batch Size: 2048
+ In Collection: ResNet
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.55
+ Top 5 Accuracy: 94.37
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.pth
+ Config: configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py
+ - Name: resnet50_8xb256-rsb-a3-100e_in1k
+ Metadata:
+ FLOPs: 4120000000
+ Parameters: 25560000
+ Training Techniques:
+ - LAMB
+ - Weight Decay
+ - Cosine Annealing
+ - Mixup
+ - CutMix
+ - RandAugment
+ Batch Size: 2048
+ In Collection: ResNet
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.30
+ Top 5 Accuracy: 93.80
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.pth
+ Config: configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py
+ - Name: resnetv1c50_8xb32_in1k
+ Metadata:
+ FLOPs: 4360000000
+ Parameters: 25580000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.01
+ Top 5 Accuracy: 93.58
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd.pth
+ Config: configs/resnet/resnetv1c50_8xb32_in1k.py
+ - Name: resnetv1c101_8xb32_in1k
+ Metadata:
+ FLOPs: 8090000000
+ Parameters: 44570000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.30
+ Top 5 Accuracy: 94.27
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f.pth
+ Config: configs/resnet/resnetv1c101_8xb32_in1k.py
+ - Name: resnetv1c152_8xb32_in1k
+ Metadata:
+ FLOPs: 11820000000
+ Parameters: 60210000
+ In Collection: ResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.76
+ Top 5 Accuracy: 94.41
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f.pth
+ Config: configs/resnet/resnetv1c152_8xb32_in1k.py
+ - Name: resnet50_8xb8_cub
+ Metadata:
+ FLOPs: 16480000000
+ Parameters: 23920000
+ In Collection: ResNet
+ Results:
+ - Dataset: CUB-200-2011
+ Metrics:
+ Top 1 Accuracy: 88.45
+ Task: Image Classification
+ Pretrain: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.pth
+ Config: configs/resnet/resnet50_8xb8_cub.py
diff --git a/configs/resnet/resnet101_8xb16_cifar10.py b/configs/resnet/resnet101_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..166a1740b09c5fb74462a0672cd5fef54caae8f7
--- /dev/null
+++ b/configs/resnet/resnet101_8xb16_cifar10.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnet101_cifar.py',
+ '../_base_/datasets/cifar10_bs16.py',
+ '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet101_8xb32_in1k.py b/configs/resnet/resnet101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..388d2cd918ab75ec46346faa0448ef9cf2893fc8
--- /dev/null
+++ b/configs/resnet/resnet101_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+ '../_base_/models/resnet101.py', '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet152_8xb16_cifar10.py b/configs/resnet/resnet152_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f307b6aa81661558b8308094de6e8327d08c830
--- /dev/null
+++ b/configs/resnet/resnet152_8xb16_cifar10.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnet152_cifar.py',
+ '../_base_/datasets/cifar10_bs16.py',
+ '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet152_8xb32_in1k.py b/configs/resnet/resnet152_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc9dc2cee4a0fd8a9d47d461b2d5d00bf9962bf5
--- /dev/null
+++ b/configs/resnet/resnet152_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+ '../_base_/models/resnet152.py', '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet18_8xb16_cifar10.py b/configs/resnet/resnet18_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7afa397b7b6a01decd0a010816ebe3678ca44aa
--- /dev/null
+++ b/configs/resnet/resnet18_8xb16_cifar10.py
@@ -0,0 +1,4 @@
+_base_ = [
+ '../_base_/models/resnet18_cifar.py', '../_base_/datasets/cifar10_bs16.py',
+ '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet18_8xb32_in1k.py b/configs/resnet/resnet18_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac452ff75602464eba84a3eea150b30748122c69
--- /dev/null
+++ b/configs/resnet/resnet18_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+ '../_base_/models/resnet18.py', '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet34_8xb16_cifar10.py b/configs/resnet/resnet34_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f5cd517d505ea479b506b6e4756c117c392dabd
--- /dev/null
+++ b/configs/resnet/resnet34_8xb16_cifar10.py
@@ -0,0 +1,4 @@
+_base_ = [
+ '../_base_/models/resnet34_cifar.py', '../_base_/datasets/cifar10_bs16.py',
+ '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet34_8xb32_in1k.py b/configs/resnet/resnet34_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7749261c80defef7cbf94c4e1284c26382246dc6
--- /dev/null
+++ b/configs/resnet/resnet34_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+ '../_base_/models/resnet34.py', '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_32xb64-warmup-coslr_in1k.py b/configs/resnet/resnet50_32xb64-warmup-coslr_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c26245ef53a736c22c0ef7d4e9d8b7876509fe2e
--- /dev/null
+++ b/configs/resnet/resnet50_32xb64-warmup-coslr_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs64.py',
+ '../_base_/schedules/imagenet_bs2048_coslr.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_32xb64-warmup-lbs_in1k.py b/configs/resnet/resnet50_32xb64-warmup-lbs_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f24f9a0f2c54a2bb634c1f374bc1b534d63697f
--- /dev/null
+++ b/configs/resnet/resnet50_32xb64-warmup-lbs_in1k.py
@@ -0,0 +1,12 @@
+_base_ = ['./resnet50_32xb64-warmup_in1k.py']
+model = dict(
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(
+ type='LabelSmoothLoss',
+ loss_weight=1.0,
+ label_smooth_val=0.1,
+ num_classes=1000),
+ ))
diff --git a/configs/resnet/resnet50_32xb64-warmup_in1k.py b/configs/resnet/resnet50_32xb64-warmup_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..34d5288b9d3f9fcf3f0b409dc1c17906654c2170
--- /dev/null
+++ b/configs/resnet/resnet50_32xb64-warmup_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+ '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs64.py',
+ '../_base_/schedules/imagenet_bs2048.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb128_coslr-90e_in21k.py b/configs/resnet/resnet50_8xb128_coslr-90e_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d2cc1ee2830661998505310d8c7074d8ae5da6b4
--- /dev/null
+++ b/configs/resnet/resnet50_8xb128_coslr-90e_in21k.py
@@ -0,0 +1,11 @@
+_base_ = [
+ '../_base_/models/resnet50.py', '../_base_/datasets/imagenet21k_bs128.py',
+ '../_base_/schedules/imagenet_bs1024_coslr.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(head=dict(num_classes=21843))
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
diff --git a/configs/resnet/resnet50_8xb16-mixup_cifar10.py b/configs/resnet/resnet50_8xb16-mixup_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..2420ebfeb0a34675a4b1b2a69c0b8a39e197ce35
--- /dev/null
+++ b/configs/resnet/resnet50_8xb16-mixup_cifar10.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnet50_cifar_mixup.py',
+ '../_base_/datasets/cifar10_bs16.py',
+ '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb16_cifar10.py b/configs/resnet/resnet50_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..669e5de27e526dd46d9f06c99e478dce16f0ac9a
--- /dev/null
+++ b/configs/resnet/resnet50_8xb16_cifar10.py
@@ -0,0 +1,4 @@
+_base_ = [
+ '../_base_/models/resnet50_cifar.py', '../_base_/datasets/cifar10_bs16.py',
+ '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb16_cifar100.py b/configs/resnet/resnet50_8xb16_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebde6c76ecca6d23b58edfb85ebc3b72ce15a2b2
--- /dev/null
+++ b/configs/resnet/resnet50_8xb16_cifar100.py
@@ -0,0 +1,19 @@
+_base_ = [
+ '../_base_/models/resnet50_cifar.py',
+ '../_base_/datasets/cifar100_bs16.py',
+ '../_base_/schedules/cifar10_bs128.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(head=dict(num_classes=100))
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(weight_decay=0.0005))
+
+param_scheduler = dict(
+ type='MultiStepLR',
+ by_epoch=True,
+ milestones=[60, 120, 160],
+ gamma=0.2,
+)
diff --git a/configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py b/configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4ea15984a0063c06e09eb5063d49b2cf90371cf
--- /dev/null
+++ b/configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+ '../_base_/models/resnet50.py',
+ '../_base_/datasets/imagenet_bs256_rsb_a12.py',
+ '../_base_/schedules/imagenet_bs2048_rsb.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ backbone=dict(
+ norm_cfg=dict(type='SyncBN', requires_grad=True),
+ drop_path_rate=0.05,
+ ),
+ head=dict(
+ loss=dict(
+ type='LabelSmoothLoss',
+ label_smooth_val=0.1,
+ mode='original',
+ use_sigmoid=True,
+ )),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.2),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(sampler=dict(type='RepeatAugSampler', shuffle=True))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(weight_decay=0.01),
+ paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=595,
+ eta_min=1.0e-6,
+ by_epoch=True,
+ begin=5,
+ end=600)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=600)
diff --git a/configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py b/configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..df8edc0370400a3f3985c33bffae2d04afc55772
--- /dev/null
+++ b/configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+ '../_base_/models/resnet50.py',
+ '../_base_/datasets/imagenet_bs256_rsb_a12.py',
+ '../_base_/schedules/imagenet_bs2048_rsb.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ backbone=dict(
+ norm_cfg=dict(type='SyncBN', requires_grad=True),
+ drop_path_rate=0.05,
+ ),
+ head=dict(loss=dict(use_sigmoid=True)),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.1),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# dataset settings
+train_dataloader = dict(sampler=dict(type='RepeatAugSampler', shuffle=True))
+
+# schedule settings
+optim_wrapper = dict(
+ paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.))
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=295,
+ eta_min=1.0e-6,
+ by_epoch=True,
+ begin=5,
+ end=300)
+]
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py b/configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a36c5843a69aea20fdb9287561e5c2a96459852
--- /dev/null
+++ b/configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py
@@ -0,0 +1,22 @@
+_base_ = [
+ '../_base_/models/resnet50.py',
+ '../_base_/datasets/imagenet_bs256_rsb_a3.py',
+ '../_base_/schedules/imagenet_bs2048_rsb.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ backbone=dict(norm_cfg=dict(type='SyncBN', requires_grad=True)),
+ head=dict(loss=dict(use_sigmoid=True)),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.1),
+ dict(type='CutMix', alpha=1.0)
+ ]),
+)
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=0.008),
+ paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
diff --git a/configs/resnet/resnet50_8xb32-coslr-preciseBN_in1k.py b/configs/resnet/resnet50_8xb32-coslr-preciseBN_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..01fefbbf2852eeceddb0ad026fb5098e763e0710
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-coslr-preciseBN_in1k.py
@@ -0,0 +1,13 @@
+_base_ = 'resnet50_8xb32-coslr_in1k.py'
+
+# Precise BN hook will update the bn stats, so this hook should be executed
+# before CheckpointHook(priority of 'VERY_LOW') and
+# EMAHook(priority of 'NORMAL') So set the priority of PreciseBNHook to
+# 'ABOVENORMAL' here.
+custom_hooks = [
+ dict(
+ type='PreciseBNHook',
+ num_samples=8192,
+ interval=1,
+ priority='ABOVE_NORMAL')
+]
diff --git a/configs/resnet/resnet50_8xb32-coslr_in1k.py b/configs/resnet/resnet50_8xb32-coslr_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..938a114b79696b5ad3442c1dd2a7aea33342b679
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-coslr_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256_coslr.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32-cutmix_in1k.py b/configs/resnet/resnet50_8xb32-cutmix_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f8d0ca9f3a500344c18b669f25f3cb78393d7dd
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-cutmix_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnet50_cutmix.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32-fp16-dynamic_in1k.py b/configs/resnet/resnet50_8xb32-fp16-dynamic_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..58f6fe4cf25e8f0b3d321a7aab4b746552aa4163
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-fp16-dynamic_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['./resnet50_8xb32_in1k.py']
+
+# schedule settings
+optim_wrapper = dict(type='AmpOptimWrapper', loss_scale='dynamic')
diff --git a/configs/resnet/resnet50_8xb32-fp16_in1k.py b/configs/resnet/resnet50_8xb32-fp16_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..19ee6ee4f82ec02f34628bdf8dd74a379798cc67
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-fp16_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['./resnet50_8xb32_in1k.py']
+
+# schedule settings
+optim_wrapper = dict(type='AmpOptimWrapper', loss_scale=512.)
diff --git a/configs/resnet/resnet50_8xb32-lbs_in1k.py b/configs/resnet/resnet50_8xb32-lbs_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c1aa5a2c4eee10c10159175224d9b77ea57e57b
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-lbs_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnet50_label_smooth.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32-mixup_in1k.py b/configs/resnet/resnet50_8xb32-mixup_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a153d0e18f521f72b8beaf4cbea36d41f5b3300
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-mixup_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnet50_mixup.py',
+ '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32_in1k.py b/configs/resnet/resnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c32f333b67c255c6101469323636bf242eebb8da
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+ '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs32.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb8_cub.py b/configs/resnet/resnet50_8xb8_cub.py
new file mode 100644
index 0000000000000000000000000000000000000000..17054ef536930d74136897f8f25637321a364ce7
--- /dev/null
+++ b/configs/resnet/resnet50_8xb8_cub.py
@@ -0,0 +1,20 @@
+_base_ = [
+ '../_base_/models/resnet50.py',
+ '../_base_/datasets/cub_bs8_448.py',
+ '../_base_/schedules/cub_bs64.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+# use pre-train weight converted from https://github.com/Alibaba-MIIL/ImageNet21K # noqa
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth' # noqa
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ init_cfg=dict(
+ type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+ head=dict(num_classes=200, ))
+
+# runtime settings
+default_hooks = dict(logger=dict(type='LoggerHook', interval=20))
diff --git a/configs/resnet/resnetv1c101_8xb32_in1k.py b/configs/resnet/resnetv1c101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..441aff591851f402a176c142c93dc866a77b82c2
--- /dev/null
+++ b/configs/resnet/resnetv1c101_8xb32_in1k.py
@@ -0,0 +1,7 @@
+_base_ = [
+ '../_base_/models/resnetv1c50.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(depth=101))
diff --git a/configs/resnet/resnetv1c152_8xb32_in1k.py b/configs/resnet/resnetv1c152_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9f466f85c8e8c89fb78f53c27eca1d5acaf5221
--- /dev/null
+++ b/configs/resnet/resnetv1c152_8xb32_in1k.py
@@ -0,0 +1,7 @@
+_base_ = [
+ '../_base_/models/resnetv1c50.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(depth=152))
diff --git a/configs/resnet/resnetv1c50_8xb32_in1k.py b/configs/resnet/resnetv1c50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa1c8b6475ce373f4a35123a72e31419b87027c0
--- /dev/null
+++ b/configs/resnet/resnetv1c50_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnetv1c50.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnetv1d101_8xb32_in1k.py b/configs/resnet/resnetv1d101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b16ca863db2c50267764b1b37aa8b2db891ad2c9
--- /dev/null
+++ b/configs/resnet/resnetv1d101_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnetv1d101.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnetv1d152_8xb32_in1k.py b/configs/resnet/resnetv1d152_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..76926ddbb661029b8cff86ad0d98028531235fa1
--- /dev/null
+++ b/configs/resnet/resnetv1d152_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnetv1d152.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnetv1d50_8xb32_in1k.py b/configs/resnet/resnetv1d50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..208bde470ad12407d7e56eddeddfc88529e3708b
--- /dev/null
+++ b/configs/resnet/resnetv1d50_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnetv1d50.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/README.md b/configs/resnext/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b901b31bd5bd3b99bce07cc2454e4b9a12d40bb2
--- /dev/null
+++ b/configs/resnext/README.md
@@ -0,0 +1,83 @@
+# ResNeXt
+
+> [Aggregated Residual Transformations for Deep Neural Networks](https://openaccess.thecvf.com/content_cvpr_2017/html/Xie_Aggregated_Residual_Transformations_CVPR_2017_paper.html)
+
+
+
+## Abstract
+
+We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call "cardinality" (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnext50-32x4d_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnext50-32x4d_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/resnext/resnext50-32x4d_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/resnext/resnext50-32x4d_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :--------------------------------------------------------------------------------: |
+| `resnext50-32x4d_8xb32_in1k` | From scratch | 25.03 | 4.27 | 77.90 | 93.66 | [config](resnext50-32x4d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.json) |
+| `resnext101-32x4d_8xb32_in1k` | From scratch | 44.18 | 8.03 | 78.61 | 94.17 | [config](resnext101-32x4d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.json) |
+| `resnext101-32x8d_8xb32_in1k` | From scratch | 88.79 | 16.50 | 79.27 | 94.58 | [config](resnext101-32x8d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.json) |
+| `resnext152-32x4d_8xb32_in1k` | From scratch | 59.95 | 11.80 | 78.88 | 94.33 | [config](resnext152-32x4d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{xie2017aggregated,
+ title={Aggregated residual transformations for deep neural networks},
+ author={Xie, Saining and Girshick, Ross and Doll{\'a}r, Piotr and Tu, Zhuowen and He, Kaiming},
+ booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+ pages={1492--1500},
+ year={2017}
+}
+```
diff --git a/configs/resnext/metafile.yml b/configs/resnext/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..71283288fd743116c00b14ee1dc1697770b0706c
--- /dev/null
+++ b/configs/resnext/metafile.yml
@@ -0,0 +1,73 @@
+Collections:
+ - Name: ResNeXt
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Epochs: 100
+ Batch Size: 256
+ Architecture:
+ - ResNeXt
+ Paper:
+ URL: https://openaccess.thecvf.com/content_cvpr_2017/html/Xie_Aggregated_Residual_Transformations_CVPR_2017_paper.html
+ Title: "Aggregated Residual Transformations for Deep Neural Networks"
+ README: configs/resnext/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/resnext.py#L90
+ Version: v0.15.0
+
+Models:
+ - Name: resnext50-32x4d_8xb32_in1k
+ Metadata:
+ FLOPs: 4270000000
+ Parameters: 25030000
+ In Collection: ResNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.90
+ Top 5 Accuracy: 93.66
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth
+ Config: configs/resnext/resnext50-32x4d_8xb32_in1k.py
+ - Name: resnext101-32x4d_8xb32_in1k
+ Metadata:
+ FLOPs: 8030000000
+ Parameters: 44180000
+ In Collection: ResNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.61
+ Top 5 Accuracy: 94.17
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.pth
+ Config: configs/resnext/resnext101-32x4d_8xb32_in1k.py
+ - Name: resnext101-32x8d_8xb32_in1k
+ Metadata:
+ FLOPs: 16500000000
+ Parameters: 88790000
+ In Collection: ResNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.27
+ Top 5 Accuracy: 94.58
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.pth
+ Config: configs/resnext/resnext101-32x8d_8xb32_in1k.py
+ - Name: resnext152-32x4d_8xb32_in1k
+ Metadata:
+ FLOPs: 11800000000
+ Parameters: 59950000
+ In Collection: ResNeXt
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.88
+ Top 5 Accuracy: 94.33
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.pth
+ Config: configs/resnext/resnext152-32x4d_8xb32_in1k.py
diff --git a/configs/resnext/resnext101-32x4d_8xb32_in1k.py b/configs/resnext/resnext101-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..970aa60f35fb6b04f72688d5862155575858b1fe
--- /dev/null
+++ b/configs/resnext/resnext101-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnext101_32x4d.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/resnext101-32x8d_8xb32_in1k.py b/configs/resnext/resnext101-32x8d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..315d05fd57b34d80ab1590077f98d21b80453209
--- /dev/null
+++ b/configs/resnext/resnext101-32x8d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnext101_32x8d.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/resnext152-32x4d_8xb32_in1k.py b/configs/resnext/resnext152-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c137313cb7f357f8328048ffe833cdc4952cb84
--- /dev/null
+++ b/configs/resnext/resnext152-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnext152_32x4d.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/resnext50-32x4d_8xb32_in1k.py b/configs/resnext/resnext50-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd9c9fcf4e6d9941cb87ffc963cc99b39069116c
--- /dev/null
+++ b/configs/resnext/resnext50-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/resnext50_32x4d.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/revvit/README.md b/configs/revvit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0439b22ac9d196a56016503f210fc73d3baab71d
--- /dev/null
+++ b/configs/revvit/README.md
@@ -0,0 +1,91 @@
+# Reversible Vision Transformers
+
+> [Reversible Vision Transformers](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf)
+
+
+
+## Introduction
+
+**RevViT** is initially described in [Reversible Vision Tranformers](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf), which introduce the reversible idea into vision transformer, to reduce the GPU memory footprint required for training.
+
+
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory footprint from the depth of the model, Reversible Vision Transformers enable memory efficient scaling of transformer architectures. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5× at identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 3.9× over their non-reversible counterparts.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('revvit-small_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('revvit-small_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/revvit/revvit-small_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/revvit/revvit-base_3rdparty_in1k_20221213-87a7b0a5.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------: | :----------------------------------------------------------------------------------: |
+| `revvit-small_3rdparty_in1k`\* | From scratch | 22.44 | 4.58 | 79.87 | 94.90 | [config](revvit-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/revvit/revvit-base_3rdparty_in1k_20221213-87a7b0a5.pth) |
+| `revvit-base_3rdparty_in1k`\* | From scratch | 87.34 | 17.49 | 81.81 | 95.56 | [config](revvit-base_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/revvit/revvit-small_3rdparty_in1k_20221213-a3a34f5c.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/SlowFast). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{mangalam2022reversible,
+ title={Reversible Vision Transformers},
+ author={Mangalam, Karttikeya and Fan, Haoqi and Li, Yanghao and Wu, Chao-Yuan and Xiong, Bo and Feichtenhofer, Christoph and Malik, Jitendra},
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+ pages={10830--10840},
+ year={2022}
+}
+```
diff --git a/configs/revvit/metafile.yml b/configs/revvit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..842de071f1b15cc9bc65b1ff85d208b6d7131b9d
--- /dev/null
+++ b/configs/revvit/metafile.yml
@@ -0,0 +1,48 @@
+Collections:
+ - Name: RevViT
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Vision Transformer
+ - Reversible
+ Paper:
+ URL: https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf
+ Title: Reversible Vision Transformers
+ README: configs/revvit/README.md
+ Code:
+ Version: v1.0.0rc5
+ URL: https://github.com/open-mmlab/mmpretrain/blob/1.0.0rc5/mmcls/models/backbones/revvit.py
+
+Models:
+ - Name: revvit-small_3rdparty_in1k
+ Metadata:
+ FLOPs: 4583427072
+ Parameters: 22435432
+ In Collection: RevViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.87
+ Top 5 Accuracy: 94.90
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/revvit/revvit-small_3rdparty_in1k_20221213-a3a34f5c.pth
+ Config: configs/revvit/revvit-small_8xb256_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_S.pyth
+ Code: https://github.com/facebookresearch/SlowFast
+ - Name: revvit-base_3rdparty_in1k
+ Metadata:
+ FLOPs: 17490450432
+ Parameters: 87337192
+ In Collection: RevViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.81
+ Top 5 Accuracy: 95.56
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/revvit/revvit-base_3rdparty_in1k_20221213-87a7b0a5.pth
+ Config: configs/revvit/revvit-base_8xb256_in1k.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_B.pyth
+ Code: https://github.com/facebookresearch/SlowFast
diff --git a/configs/revvit/revvit-base_8xb256_in1k.py b/configs/revvit/revvit-base_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4fde5c9487fb675b75c824608f88ba96f27e9aa
--- /dev/null
+++ b/configs/revvit/revvit-base_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/revvit/revvit-base.py',
+ '../_base_/datasets/imagenet_bs128_revvit_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_revvit.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/revvit/revvit-small_8xb256_in1k.py b/configs/revvit/revvit-small_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec3904a3da8164f7f69c61e49d9dfee217a6b99b
--- /dev/null
+++ b/configs/revvit/revvit-small_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/revvit/revvit-small.py',
+ '../_base_/datasets/imagenet_bs128_revvit_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_revvit.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/riformer/README.md b/configs/riformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6be694d1bf72fd7ba5e5bac0c99d33b9338e0893
--- /dev/null
+++ b/configs/riformer/README.md
@@ -0,0 +1,181 @@
+# RIFormer
+
+> [RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer](https://arxiv.org/abs/2304.05659)
+
+
+
+## Introduction
+
+RIFormer is a way to keep a vision backbone effective while removing token mixers in its basic building blocks. Equipped with our proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. RIFormer shares nearly the same macro and micro design as MetaFormer, but safely removing all token mixers. The quantitative results show that our networks outperform many prevailing backbones with faster inference speed on ImageNet-1K.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design.
+
+
+
+
+## How to use
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool or `switch_to_deploy` interface to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+
+
+**Predict image**
+
+Use `classifier.backbone.switch_to_deploy()` interface to switch the RIFormer models into inference mode.
+
+```python
+>>> import torch
+>>> from mmpretrain import get_model, inference_model
+>>>
+>>> model = get_model("riformer-s12_in1k", pretrained=True)
+>>> results = inference_model(model, 'demo/demo.JPEG')
+>>> print( (results['pred_class'], results['pred_score']) )
+('sea snake', 0.7827484011650085)
+>>>
+>>> # switch to deploy mode
+>>> model.backbone.switch_to_deploy()
+>>> results = inference_model(model, 'demo/demo.JPEG')
+>>> print( (results['pred_class'], results['pred_score']) )
+('sea snake', 0.7827480435371399)
+```
+
+**Use the model**
+
+```python
+>>> import torch
+>>>
+>>> model = get_model("riformer-s12_in1k", pretrained=True)
+>>> model.eval()
+>>> inputs = torch.rand(1, 3, 224, 224).to(model.data_preprocessor.device)
+>>> # To get classification scores.
+>>> out = model(inputs)
+>>> print(out.shape)
+torch.Size([1, 1000])
+>>> # To extract features.
+>>> outs = model.extract_feat(inputs)
+>>> print(outs[0].shape)
+torch.Size([1, 512])
+>>>
+>>> # switch to deploy mode
+>>> model.backbone.switch_to_deploy()
+>>> out_deploy = model(inputs)
+>>> print(out.shape)
+torch.Size([1, 1000])
+>>> assert torch.allclose(out, out_deploy, rtol=1e-4, atol=1e-5) # pass without error
+```
+
+**Test Command**
+
+Place the ImageNet dataset to the `data/imagenet/` directory, or prepare datasets according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+*224×224*
+
+Download Checkpoint:
+
+```shell
+wget https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+```
+
+Test use unfused model:
+
+```shell
+python tools/test.py configs/riformer/riformer-s12_8xb128_in1k.py riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+```
+
+Reparameterize checkpoint:
+
+```shell
+python tools/model_converters/reparameterize_model.py configs/riformer/riformer-s12_8xb128_in1k.py riformer-s12_32xb128_in1k_20230406-6741ce71.pth riformer-s12_deploy.pth
+```
+
+Test use fused model:
+
+```shell
+python tools/test.py configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py riformer-s12_deploy.pth
+```
+
+
+
+For more configurable parameters, please refer to the [API](https://mmpretrain.readthedocs.io/en/latest/api/generated/mmpretrain.models.backbones.RIFormer.html#mmpretrain.models.backbones.RIFormer).
+
+
+
+How to use the reparameterization tool(click to show)
+
+
+
+Use provided tool to reparameterize the given model and save the checkpoint:
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file path, `${SRC_CKPT_PATH}` is the source chenpoint file path, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+For example:
+
+```shell
+# download the weight
+wget https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+
+# reparameterize unfused weight to fused weight
+python tools/model_converters/reparameterize_model.py configs/riformer/riformer-s12_8xb128_in1k.py riformer-s12_32xb128_in1k_20230406-6741ce71.pth riformer-s12_deploy.pth
+```
+
+To use reparameterized weights, you can use the deploy model config file such as the [s12_deploy example](./deploy/riformer-s12-deploy_8xb128_in1k.py):
+
+```text
+# in riformer-s12-deploy_8xb128_in1k.py
+_base_ = '../deploy/riformer-s12-deploy_8xb128_in1k.py' # basic s12 config
+
+model = dict(backbone=dict(deploy=True)) # switch model into deploy mode
+```
+
+```shell
+python tools/test.py configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py riformer-s12_deploy.pth
+```
+
+
+
+
+
+## Results and models
+
+### ImageNet-1k
+
+| Model | resolution | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :-------------------: | :--------: | :-------: | :------: | :-------: | :-------: | :-------------------------------------------: | :---------------------------------------------------------------------------------------: |
+| riformer-s12_in1k | 224x224 | 11.92 | 1.82 | 76.90 | 93.06 | [config](./riformer-s12_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth) |
+| riformer-s24_in1k | 224x224 | 21.39 | 3.41 | 80.28 | 94.80 | [config](./riformer-s24_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k_20230406-fdab072a.pth) |
+| riformer-s36_in1k | 224x224 | 30.86 | 5.00 | 81.29 | 95.41 | [config](./riformer-s36_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k_20230406-fdfcd3b0.pth) |
+| riformer-m36_in1k | 224x224 | 56.17 | 8.80 | 82.57 | 95.99 | [config](./riformer-m36_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k_20230406-2fcb9d9b.pth) |
+| riformer-m48_in1k | 224x224 | 73.47 | 11.59 | 82.75 | 96.11 | [config](./riformer-m48_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k_20230406-2b9d1abf.pth) |
+| riformer-s12_384_in1k | 384x384 | 11.92 | 5.36 | 78.29 | 93.93 | [config](./riformer-s12_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k-384px_20230406-145eda4c.pth) |
+| riformer-s24_384_in1k | 384x384 | 21.39 | 10.03 | 81.36 | 95.40 | [config](./riformer-s24_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k-384px_20230406-bafae7ab.pth) |
+| riformer-s36_384_in1k | 384x384 | 30.86 | 14.70 | 82.22 | 95.95 | [config](./riformer-s36_8xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k-384px_20230406-017ed3c4.pth) |
+| riformer-m36_384_in1k | 384x384 | 56.17 | 25.87 | 83.39 | 96.40 | [config](./riformer-m36_8xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k-384px_20230406-66a6f764.pth) |
+| riformer-m48_384_in1k | 384x384 | 73.47 | 34.06 | 83.70 | 96.60 | [config](./riformer-m48_8xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k-384px_20230406-2e874826.pth) |
+
+The config files of these models are only for inference.
+
+## Citation
+
+```bibtex
+@inproceedings{wang2023riformer,
+ title={RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer},
+ author={Wang, Jiahao and Zhang, Songyang and Liu, Yong and Wu, Taiqiang and Yang, Yujiu and Liu, Xihui and Chen, Kai and Luo, Ping and Lin, Dahua},
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+ year={2023}
+}
+```
diff --git a/configs/riformer/deploy/riformer-m36-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-m36-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcec41c810849d20c080faa1a710692e4b2bb9a0
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m36-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m36_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-m36-deploy_8xb64_in1k-384px.py b/configs/riformer/deploy/riformer-m36-deploy_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e18f836f89d9057b1d8a1b6d31cd83d6bdca6b3a
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m36-deploy_8xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m36_8xb64_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k-384px.py b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ab33534e271ccad60a9f6d896fa15238601a4e0
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m48_8xb64_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k.py b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e32ad328f893aaa0da1a4072315a91f514a594ce
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m48_8xb64_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k-384px.py b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffbb4be31d76716432ff283d9d7c2d77370ddbb0
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s12_8xb128_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..70fd8b74342e07ec2e3b4299364681ffbea5ec25
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s12_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k-384px.py b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d05e5c1a14afe10e05ae648e47c16d53220f226
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s24_8xb128_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..47f83a08f4f2c6fa6ffc7105265b41c12e30fd2e
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s24_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s36-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-s36-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c03bb15106829f22ba959d2a84d0a92ceba4dac
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s36-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s36_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s36-deploy_8xb64_in1k-384px.py b/configs/riformer/deploy/riformer-s36-deploy_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..67b17ee5173e5bef7d2ecdf6d92e09cbb48db482
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s36-deploy_8xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s36_8xb64_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/metafile.yml b/configs/riformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5f3e2ec8773d26cde570bb874d2a45a73a49bc7b
--- /dev/null
+++ b/configs/riformer/metafile.yml
@@ -0,0 +1,152 @@
+Collections:
+ - Name: RIFormer
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Resources: 8x A100 GPUs
+ Architecture:
+ - Affine
+ - 1x1 Convolution
+ - LayerScale
+ Paper:
+ URL: https://arxiv.org/abs/xxxx.xxxxx
+ Title: "RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer"
+ README: configs/riformer/README.md
+ Code:
+ Version: v1.0.0rc7
+ URL: null
+
+Models:
+ - Name: riformer-s12_in1k
+ Metadata:
+ FLOPs: 1822000000
+ Parameters: 11915000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.90
+ Top 5 Accuracy: 93.06
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+ Config: configs/riformer/riformer-s12_8xb128_in1k.py
+ - Name: riformer-s24_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 3412000000
+ Parameters: 21389000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.28
+ Top 5 Accuracy: 94.80
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k_20230406-fdab072a.pth
+ Config: configs/riformer/riformer-s24_8xb128_in1k.py
+ - Name: riformer-s36_in1k
+ Metadata:
+ FLOPs: 5003000000
+ Parameters: 30863000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.29
+ Top 5 Accuracy: 95.41
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k_20230406-fdfcd3b0.pth
+ Config: configs/riformer/riformer-s36_8xb128_in1k.py
+ - Name: riformer-m36_in1k
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 8801000000
+ Parameters: 56173000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.57
+ Top 5 Accuracy: 95.99
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k_20230406-2fcb9d9b.pth
+ Config: configs/riformer/riformer-m36_8xb128_in1k.py
+ - Name: riformer-m48_in1k
+ Metadata:
+ FLOPs: 11590000000
+ Parameters: 73473000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.75
+ Top 5 Accuracy: 96.11
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k_20230406-2b9d1abf.pth
+ Config: configs/riformer/riformer-m48_8xb64_in1k.py
+ - Name: riformer-s12_in1k-384
+ Metadata:
+ FLOPs: 5355000000
+ Parameters: 11915000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.29
+ Top 5 Accuracy: 93.93
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k-384px_20230406-145eda4c.pth
+ Config: configs/riformer/riformer-s12_8xb128_in1k-384px.py
+ - Name: riformer-s24_in1k-384
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 10028000000
+ Parameters: 21389000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.36
+ Top 5 Accuracy: 95.40
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k-384px_20230406-bafae7ab.pth
+ Config: configs/riformer/riformer-s24_8xb128_in1k-384px.py
+ - Name: riformer-s36_in1k-384
+ Metadata:
+ FLOPs: 14702000000
+ Parameters: 30863000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.22
+ Top 5 Accuracy: 95.95
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k-384px_20230406-017ed3c4.pth
+ Config: configs/riformer/riformer-s36_8xb64_in1k-384px.py
+ - Name: riformer-m36_in1k-384
+ Metadata:
+ Training Data: ImageNet-1k
+ FLOPs: 25865000000
+ Parameters: 56173000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.39
+ Top 5 Accuracy: 96.40
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k-384px_20230406-66a6f764.pth
+ Config: configs/riformer/riformer-m36_8xb64_in1k-384px.py
+ - Name: riformer-m48_in1k-384
+ Metadata:
+ FLOPs: 34060000000
+ Parameters: 73473000
+ In Collection: RIFormer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.70
+ Top 5 Accuracy: 96.60
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k-384px_20230406-2e874826.pth
+ Config: configs/riformer/riformer-m48_8xb64_in1k-384px.py
diff --git a/configs/riformer/riformer-m36_8xb128_in1k.py b/configs/riformer/riformer-m36_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..30e93aa83d0f5c0b379367e2dc9b7a7d038108b4
--- /dev/null
+++ b/configs/riformer/riformer-m36_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='m36',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-m36_8xb64_in1k-384px.py b/configs/riformer/riformer-m36_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..57f687cd50b60d99978dec7baeec4bf6a67e5de5
--- /dev/null
+++ b/configs/riformer/riformer-m36_8xb64_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_riformer_medium_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='m36',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-m48_8xb64_in1k-384px.py b/configs/riformer/riformer-m48_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef6f1964624f76e204a5d257ddee2410f21ab456
--- /dev/null
+++ b/configs/riformer/riformer-m48_8xb64_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_riformer_medium_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='m48',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-m48_8xb64_in1k.py b/configs/riformer/riformer-m48_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9dc5c3e291f136d40633e05c9c2931d140c532bc
--- /dev/null
+++ b/configs/riformer/riformer-m48_8xb64_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='m48',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s12_8xb128_in1k-384px.py b/configs/riformer/riformer-s12_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d19dae07c811aeb0ca5af3cb92e57903405e49b
--- /dev/null
+++ b/configs/riformer/riformer-s12_8xb128_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_riformer_small_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='s12',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s12_8xb128_in1k.py b/configs/riformer/riformer-s12_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e85f8fb883de19f1021b8148fc680711149b5a9d
--- /dev/null
+++ b/configs/riformer/riformer-s12_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='s12',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s24_8xb128_in1k-384px.py b/configs/riformer/riformer-s24_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a1ec7b57385c4910ffaebcd152296bbdee360e1
--- /dev/null
+++ b/configs/riformer/riformer-s24_8xb128_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_riformer_small_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='s24',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s24_8xb128_in1k.py b/configs/riformer/riformer-s24_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..560cddcf8829703d2f1e9aaf4856e947b762b49a
--- /dev/null
+++ b/configs/riformer/riformer-s24_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='s24',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s36_8xb128_in1k.py b/configs/riformer/riformer-s36_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..28511307a294031301cb425d513844780d199606
--- /dev/null
+++ b/configs/riformer/riformer-s36_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='s36',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s36_8xb64_in1k-384px.py b/configs/riformer/riformer-s36_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b3077357051632c81426e5d94322558412430373
--- /dev/null
+++ b/configs/riformer/riformer-s36_8xb64_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs128_riformer_small_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='RIFormer',
+ arch='s36',
+ drop_path_rate=0.1,
+ init_cfg=[
+ dict(
+ type='TruncNormal',
+ layer=['Conv2d', 'Linear'],
+ std=.02,
+ bias=0.),
+ dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+ ]),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(lr=4e-3),
+ clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/sam/README.md b/configs/sam/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1a5668a3d0bff5aacac10f26a41714afe3622c78
--- /dev/null
+++ b/configs/sam/README.md
@@ -0,0 +1,57 @@
+# SAM
+
+> [Segment Anything](https://arxiv.org/abs/2304.02643)
+
+
+
+## Abstract
+
+We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billionmasks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-base-p16_sam-pre_3rdparty_sa1b-1024px', pretrained=True)
+inputs = torch.rand(1, 3, 1024, 1024)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :--------------------------------------------- | :--------: | :-------: | :-------------------------------------: | :----------------------------------------------------------------------------------------------: |
+| `vit-base-p16_sam-pre_3rdparty_sa1b-1024px`\* | 89.67 | 486.00 | [config](vit-base-p16_sam_headless.py) | [model](https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-base-p16_sam-pre_3rdparty_sa1b-1024px_20230411-2320f9cc.pth) |
+| `vit-large-p16_sam-pre_3rdparty_sa1b-1024px`\* | 308.00 | 1494.00 | [config](vit-large-p16_sam_headless.py) | [model](https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-large-p16_sam-pre_3rdparty_sa1b-1024px_20230411-595feafd.pth) |
+| `vit-huge-p16_sam-pre_3rdparty_sa1b-1024px`\* | 637.00 | 2982.00 | [config](vit-huge-p16_sam_headless.py) | [model](https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-huge-p16_sam-pre_3rdparty_sa1b-1024px_20230411-3f13c653.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/segment-anything/). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{kirillov2023segany,
+ title={Segment Anything},
+ author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
+ journal={arXiv:2304.02643},
+ year={2023}
+}
+```
diff --git a/configs/sam/metafile.yml b/configs/sam/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1ac65ce7715e91468e108132493ecdcbb4db277c
--- /dev/null
+++ b/configs/sam/metafile.yml
@@ -0,0 +1,61 @@
+Collections:
+ - Name: SAM
+ Metadata:
+ Architecture:
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ Paper:
+ Title: 'Segment Anything'
+ URL: https://arxiv.org/abs/2304.02643
+ README: configs/sam/README.md
+ Code:
+ URL: null
+ Version: null
+
+Models:
+ - Name: vit-base-p16_sam-pre_3rdparty_sa1b-1024px
+ Metadata:
+ FLOPs: 486000000000
+ Parameters: 89671000
+ Training Data:
+ - SA-1B
+ In Collection: SAM
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-base-p16_sam-pre_3rdparty_sa1b-1024px_20230411-2320f9cc.pth
+ Config: configs/sam/vit-base-p16_sam_headless.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
+ Code: https://github.com/facebookresearch/segment-anything/
+
+ - Name: vit-large-p16_sam-pre_3rdparty_sa1b-1024px
+ Metadata:
+ FLOPs: 1494000000000
+ Parameters: 308000000
+ Training Data:
+ - SA-1B
+ In Collection: SAM
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-large-p16_sam-pre_3rdparty_sa1b-1024px_20230411-595feafd.pth
+ Config: configs/sam/vit-large-p16_sam_headless.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth
+ Code: https://github.com/facebookresearch/segment-anything/
+
+ - Name: vit-huge-p16_sam-pre_3rdparty_sa1b-1024px
+ Metadata:
+ FLOPs: 2982000000000
+ Parameters: 637000000
+ Training Data:
+ - SA-1B
+ In Collection: SAM
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-huge-p16_sam-pre_3rdparty_sa1b-1024px_20230411-3f13c653.pth
+ Config: configs/sam/vit-huge-p16_sam_headless.py
+ Converted From:
+ Weights: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+ Code: https://github.com/facebookresearch/segment-anything/
diff --git a/configs/sam/vit-base-p16_sam_headless.py b/configs/sam/vit-base-p16_sam_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..bea26376ee932af5704fd5d232efc3cdf128e310
--- /dev/null
+++ b/configs/sam/vit-base-p16_sam_headless.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTSAM',
+ arch='base',
+ img_size=1024,
+ patch_size=16,
+ out_channels=256,
+ use_abs_pos=True,
+ use_rel_pos=True,
+ window_size=14,
+ ),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/sam/vit-huge-p16_sam_headless.py b/configs/sam/vit-huge-p16_sam_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..8004755bfbe7dd0e5366297f03f73494dc27c27b
--- /dev/null
+++ b/configs/sam/vit-huge-p16_sam_headless.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTSAM',
+ arch='huge',
+ img_size=1024,
+ patch_size=16,
+ out_channels=256,
+ use_abs_pos=True,
+ use_rel_pos=True,
+ window_size=14,
+ ),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/sam/vit-large-p16_sam_headless.py b/configs/sam/vit-large-p16_sam_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cebeb098205d081a4340fb4af369e2c29a20d66
--- /dev/null
+++ b/configs/sam/vit-large-p16_sam_headless.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ViTSAM',
+ arch='large',
+ img_size=1024,
+ patch_size=16,
+ out_channels=256,
+ use_abs_pos=True,
+ use_rel_pos=True,
+ window_size=14,
+ ),
+ neck=None,
+ head=None,
+)
+
+data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
diff --git a/configs/seresnet/README.md b/configs/seresnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b5151ccde85112f12af2170796b169933e9a93ab
--- /dev/null
+++ b/configs/seresnet/README.md
@@ -0,0 +1,81 @@
+# SEResNet
+
+> [Squeeze-and-Excitation Networks](https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html)
+
+
+
+## Abstract
+
+The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('seresnet50_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('seresnet50_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/seresnet/seresnet50_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/seresnet/seresnet50_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-ae206104.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :------------------------------------------------------------------------------------------: |
+| `seresnet50_8xb32_in1k` | From scratch | 28.09 | 4.13 | 77.74 | 93.84 | [config](seresnet50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-ae206104.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200708-657b3c36.log.json) |
+| `seresnet101_8xb32_in1k` | From scratch | 49.33 | 7.86 | 78.26 | 94.07 | [config](seresnet101_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet101_batch256_imagenet_20200804-ba5b51d4.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet101_batch256_imagenet_20200708-038a4d04.log.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{hu2018squeeze,
+ title={Squeeze-and-excitation networks},
+ author={Hu, Jie and Shen, Li and Sun, Gang},
+ booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+ pages={7132--7141},
+ year={2018}
+}
+```
diff --git a/configs/seresnet/metafile.yml b/configs/seresnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1a9f116da4c8014e91e31af5db33d7b13b151826
--- /dev/null
+++ b/configs/seresnet/metafile.yml
@@ -0,0 +1,47 @@
+Collections:
+ - Name: SEResNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Epochs: 140
+ Batch Size: 256
+ Architecture:
+ - ResNet
+ Paper:
+ URL: https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html
+ Title: "Squeeze-and-Excitation Networks"
+ README: configs/seresnet/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/seresnet.py#L58
+ Version: v0.15.0
+
+Models:
+ - Name: seresnet50_8xb32_in1k
+ Metadata:
+ FLOPs: 4130000000
+ Parameters: 28090000
+ In Collection: SEResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.74
+ Top 5 Accuracy: 93.84
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-ae206104.pth
+ Config: configs/seresnet/seresnet50_8xb32_in1k.py
+ - Name: seresnet101_8xb32_in1k
+ Metadata:
+ FLOPs: 7860000000
+ Parameters: 49330000
+ In Collection: SEResNet
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.26
+ Top 5 Accuracy: 94.07
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet101_batch256_imagenet_20200804-ba5b51d4.pth
+ Config: configs/seresnet/seresnet101_8xb32_in1k.py
diff --git a/configs/seresnet/seresnet101_8xb32_in1k.py b/configs/seresnet/seresnet101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8be39e7a32aa38a5c7d0b355c39a28ddff087cf1
--- /dev/null
+++ b/configs/seresnet/seresnet101_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/seresnet101.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/seresnet/seresnet50_8xb32_in1k.py b/configs/seresnet/seresnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..19082bd0dd6bde367a064900f5c51d730bea2923
--- /dev/null
+++ b/configs/seresnet/seresnet50_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/seresnet50.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256_140e.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/seresnet/seresnext101-32x4d_8xb32_in1k.py b/configs/seresnet/seresnext101-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..01778305caf8196e73a77f39783ead80a0c3ea56
--- /dev/null
+++ b/configs/seresnet/seresnext101-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/seresnext101_32x4d.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/seresnet/seresnext50-32x4d_8xb32_in1k.py b/configs/seresnet/seresnext50-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d593e45b8992254f97de77fa4d157e9c31ce352
--- /dev/null
+++ b/configs/seresnet/seresnext50-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/seresnext50_32x4d.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/shufflenet_v1/README.md b/configs/shufflenet_v1/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..618a22d775eae984809e4881207c0f645fc1d8c9
--- /dev/null
+++ b/configs/shufflenet_v1/README.md
@@ -0,0 +1,80 @@
+# Shufflenet V1
+
+> [ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices](https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)
+
+
+
+## Abstract
+
+We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 7.8%) than recent MobileNet on ImageNet classification task, under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves ~13x actual speedup over AlexNet while maintaining comparable accuracy.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('shufflenet-v1-1x_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('shufflenet-v1-1x_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :------------------------------------------------------------------------------: |
+| `shufflenet-v1-1x_16xb64_in1k` | From scratch | 1.87 | 0.15 | 68.13 | 87.81 | [config](shufflenet-v1-1x_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{zhang2018shufflenet,
+ title={Shufflenet: An extremely efficient convolutional neural network for mobile devices},
+ author={Zhang, Xiangyu and Zhou, Xinyu and Lin, Mengxiao and Sun, Jian},
+ booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+ pages={6848--6856},
+ year={2018}
+}
+```
diff --git a/configs/shufflenet_v1/metafile.yml b/configs/shufflenet_v1/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e3ca1393e629153f81791c4f584ec0ded04839e2
--- /dev/null
+++ b/configs/shufflenet_v1/metafile.yml
@@ -0,0 +1,35 @@
+Collections:
+ - Name: Shufflenet V1
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ - No BN decay
+ Training Resources: 8x 1080 GPUs
+ Epochs: 300
+ Batch Size: 1024
+ Architecture:
+ - Shufflenet V1
+ Paper:
+ URL: https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html
+ Title: "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices"
+ README: configs/shufflenet_v1/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/shufflenet_v1.py#L152
+ Version: v0.15.0
+
+Models:
+ - Name: shufflenet-v1-1x_16xb64_in1k
+ Metadata:
+ FLOPs: 146000000
+ Parameters: 1870000
+ In Collection: Shufflenet V1
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 68.13
+ Top 5 Accuracy: 87.81
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth
+ Config: configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
diff --git a/configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py b/configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..58e45f1ba419f285d750d4487e40a3dbc803d8e1
--- /dev/null
+++ b/configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/shufflenet_v1_1x.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/shufflenet_v2/README.md b/configs/shufflenet_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..804aac18087ad8d1cf49c4b7c10ab36eb8128ade
--- /dev/null
+++ b/configs/shufflenet_v2/README.md
@@ -0,0 +1,80 @@
+# Shufflenet V2
+
+> [ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design](https://openaccess.thecvf.com/content_ECCV_2018/papers/Ningning_Light-weight_CNN_Architecture_ECCV_2018_paper.pdf)
+
+
+
+## Abstract
+
+Currently, the neural network architecture design is mostly guided by the *indirect* metric of computation complexity, i.e., FLOPs. However, the *direct* metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical *guidelines* for efficient network design. Accordingly, a new architecture is presented, called *ShuffleNet V2*. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('shufflenet-v2-1x_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('shufflenet-v2-1x_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :------------------------------------------------------------------------------: |
+| `shufflenet-v2-1x_16xb64_in1k` | From scratch | 2.28 | 0.15 | 69.55 | 88.92 | [config](shufflenet-v2-1x_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{ma2018shufflenet,
+ title={Shufflenet v2: Practical guidelines for efficient cnn architecture design},
+ author={Ma, Ningning and Zhang, Xiangyu and Zheng, Hai-Tao and Sun, Jian},
+ booktitle={Proceedings of the European conference on computer vision (ECCV)},
+ pages={116--131},
+ year={2018}
+}
+```
diff --git a/configs/shufflenet_v2/metafile.yml b/configs/shufflenet_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..9c1eebc5e9fdb66523f719bdae1bdd38a58fea84
--- /dev/null
+++ b/configs/shufflenet_v2/metafile.yml
@@ -0,0 +1,35 @@
+Collections:
+ - Name: Shufflenet V2
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ - No BN decay
+ Training Resources: 8x 1080 GPUs
+ Epochs: 300
+ Batch Size: 1024
+ Architecture:
+ - Shufflenet V2
+ Paper:
+ URL: https://openaccess.thecvf.com/content_ECCV_2018/papers/Ningning_Light-weight_CNN_Architecture_ECCV_2018_paper.pdf
+ Title: "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design"
+ README: configs/shufflenet_v2/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/shufflenet_v2.py#L134
+ Version: v0.15.0
+
+Models:
+ - Name: shufflenet-v2-1x_16xb64_in1k
+ Metadata:
+ FLOPs: 149000000
+ Parameters: 2280000
+ In Collection: Shufflenet V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 69.55
+ Top 5 Accuracy: 88.92
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth
+ Config: configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
diff --git a/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py b/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a106ab8686c985a66b1c9b6af3407ef48a40c64e
--- /dev/null
+++ b/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/shufflenet_v2_1x.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/simclr/README.md b/configs/simclr/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..17d0de2b79499ec47cdcb4e5eff59d362b77fced
--- /dev/null
+++ b/configs/simclr/README.md
@@ -0,0 +1,87 @@
+# SimCLR
+
+> [A simple framework for contrastive learning of visual representations](https://arxiv.org/abs/2002.05709)
+
+
+
+## Abstract
+
+This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('simclr_resnet50_16xb256-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :---------------------------------------- | :--------: | :-------: | :--------------------------------------------------: | :--------------------------------------------------------------------------------------: |
+| `simclr_resnet50_16xb256-coslr-200e_in1k` | 27.97 | 4.11 | [config](simclr_resnet50_16xb256-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.json) |
+| `simclr_resnet50_16xb256-coslr-800e_in1k` | 27.97 | 4.11 | [config](simclr_resnet50_16xb256-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k` | [SIMCLR 200-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.pth) | 25.56 | 4.11 | 66.90 | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.json) |
+| `resnet50_simclr-800e-pre_8xb512-linear-coslr-90e_in1k` | [SIMCLR 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.pth) | 25.56 | 4.11 | 69.20 | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-b80ae1e5.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-b80ae1e5.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{chen2020simple,
+ title={A simple framework for contrastive learning of visual representations},
+ author={Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey},
+ booktitle={ICML},
+ year={2020},
+}
+```
diff --git a/configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+ '../../_base_/models/resnet50.py',
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+ '../../_base_/default_runtime.py',
+]
+
+model = dict(
+ backbone=dict(
+ frozen_stages=4,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simclr/metafile.yml b/configs/simclr/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..23c31ed3533160739f66731b9c02f6547910dd44
--- /dev/null
+++ b/configs/simclr/metafile.yml
@@ -0,0 +1,72 @@
+Collections:
+ - Name: SimCLR
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - LARS
+ Training Resources: 8x V100 GPUs (b256), 16x A100-80G GPUs (b4096)
+ Architecture:
+ - ResNet
+ - SimCLR
+ Paper:
+ Title: A simple framework for contrastive learning of visual representations
+ URL: https://arxiv.org/abs/2002.05709
+ README: configs/simclr/README.md
+
+Models:
+ - Name: simclr_resnet50_16xb256-coslr-200e_in1k
+ Metadata:
+ Epochs: 200
+ Batch Size: 4096
+ FLOPs: 4109364224
+ Parameters: 27968832
+ Training Data: ImageNet-1k
+ In Collection: SimCLR
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.pth
+ Config: configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
+ Downstream:
+ - resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k
+ - Name: simclr_resnet50_16xb256-coslr-800e_in1k
+ Metadata:
+ Epochs: 200
+ Batch Size: 4096
+ FLOPs: 4109364224
+ Parameters: 27968832
+ Training Data: ImageNet-1k
+ In Collection: SimCLR
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.pth
+ Config: configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py
+ Downstream:
+ - resnet50_simclr-800e-pre_8xb512-linear-coslr-90e_in1k
+ - Name: resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 4096
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: SimCLR
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 66.9
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.pth
+ Config: configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
+ - Name: resnet50_simclr-800e-pre_8xb512-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 4096
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: SimCLR
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 69.2
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-b80ae1e5.pth
+ Config: configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py b/configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b48d5b31071dbb5622616b62835caa6cdd8d9589
--- /dev/null
+++ b/configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_simclr.py',
+ '../_base_/schedules/imagenet_lars_coslr_200e.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+ type='SimCLR',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=True),
+ neck=dict(
+ type='NonLinearNeck', # SimCLR non-linear neck
+ in_channels=2048,
+ hid_channels=2048,
+ out_channels=128,
+ num_layers=2,
+ with_avg_pool=True),
+ head=dict(
+ type='ContrastiveHead',
+ loss=dict(type='CrossEntropyLoss'),
+ temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'bn': dict(decay_mult=0, lars_exclude=True),
+ 'bias': dict(decay_mult=0, lars_exclude=True),
+ # bn layer in ResNet block downsample module
+ 'downsample.1': dict(decay_mult=0, lars_exclude=True),
+ }))
+
+# runtime settings
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py b/configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..478ef0c33418a9467d01c2a0c133be119318359c
--- /dev/null
+++ b/configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py
@@ -0,0 +1,57 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_simclr.py',
+ '../_base_/schedules/imagenet_lars_coslr_200e.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='SimCLR',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=True),
+ neck=dict(
+ type='NonLinearNeck', # SimCLR non-linear neck
+ in_channels=2048,
+ hid_channels=2048,
+ out_channels=128,
+ num_layers=2,
+ with_avg_pool=True),
+ head=dict(
+ type='ContrastiveHead',
+ loss=dict(type='CrossEntropyLoss'),
+ temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'bn': dict(decay_mult=0, lars_exclude=True),
+ 'bias': dict(decay_mult=0, lars_exclude=True),
+ # bn layer in ResNet block downsample module
+ 'downsample.1': dict(decay_mult=0, lars_exclude=True),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR', T_max=790, by_epoch=True, begin=10, end=800)
+]
+
+# runtime settings
+train_cfg = dict(max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py b/configs/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..36a144536e832c5e022675f3f6878d1cfa71c563
--- /dev/null
+++ b/configs/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,47 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_simclr.py',
+ '../_base_/schedules/imagenet_lars_coslr_200e.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='SimCLR',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=True),
+ neck=dict(
+ type='NonLinearNeck', # SimCLR non-linear neck
+ in_channels=2048,
+ hid_channels=2048,
+ out_channels=128,
+ num_layers=2,
+ with_avg_pool=True),
+ head=dict(
+ type='ContrastiveHead',
+ loss=dict(type='CrossEntropyLoss'),
+ temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='LARS', lr=0.3, momentum=0.9, weight_decay=1e-6),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'bn': dict(decay_mult=0, lars_exclude=True),
+ 'bias': dict(decay_mult=0, lars_exclude=True),
+ # bn layer in ResNet block downsample module
+ 'downsample.1': dict(decay_mult=0, lars_exclude=True),
+ }))
+
+# runtime settings
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/simmim/README.md b/configs/simmim/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3e44b0790086ac62c5719eba3198fd531f2dab98
--- /dev/null
+++ b/configs/simmim/README.md
@@ -0,0 +1,90 @@
+# SimMIM
+
+> [SimMIM: A Simple Framework for Masked Image Modeling](https://arxiv.org/abs/2111.09886)
+
+
+
+## Abstract
+
+This paper presents SimMIM, a simple framework for masked image modeling. We simplify recently proposed related approaches without special designs such as blockwise masking and tokenization via discrete VAE or clustering. To study what let the masked image modeling task learn good representations, we systematically study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting raw pixels of RGB values by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones. Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%. When applied on a larger model of about 650 million parameters, SwinV2H, it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by 40× less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks. The code and models will be publicly available at https: //github.com/microsoft/SimMIM .
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :-------------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :-------------------------------------------------------------: |
+| `simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px` | 89.87 | 18.83 | [config](simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.json) |
+| `simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px` | 89.87 | 18.83 | [config](simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.json) |
+| `simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px` | 199.92 | 55.85 | [config](simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px` | [SIMMIM 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth) | 87.75 | 11.30 | 82.70 | [config](benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.json) |
+| `swin-base-w7_simmim-100e-pre_8xb256-coslr-100e_in1k` | [SIMMIM 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth) | 87.77 | 15.47 | 83.50 | [config](benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py) | N/A |
+| `swin-base-w6_simmim-800e-pre_8xb256-coslr-100e_in1k-192px` | [SIMMIM 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.pth) | 87.77 | 15.47 | 83.80 | [config](benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k-224/swin-base_ft-8xb256-coslr-100e_in1k-224_20221208-155cc6e6.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k-224/swin-base_ft-8xb256-coslr-100e_in1k-224_20221208-155cc6e6.json) |
+| `swin-large-w14_simmim-800e-pre_8xb256-coslr-100e_in1k` | [SIMMIM 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.pth) | 196.85 | 38.85 | 84.80 | [config](benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224_20220916-d4865790.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224_20220916-d4865790.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{xie2021simmim,
+ title={SimMIM: A Simple Framework for Masked Image Modeling},
+ author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
+ booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
+ year={2022}
+}
+```
diff --git a/configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py b/configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..47c4fa1ccfa42b0d6a3c7eb58f43f8250441b7f3
--- /dev/null
+++ b/configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py
@@ -0,0 +1,59 @@
+_base_ = [
+ '../../_base_/models/swin_transformer/base_224.py',
+ '../../_base_/datasets/imagenet_bs256_swin_192.py',
+ '../../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ backbone=dict(
+ img_size=192,
+ drop_path_rate=0.1,
+ stage_cfgs=dict(block_cfgs=dict(window_size=6)),
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer settings
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ optimizer=dict(type='AdamW', lr=5e-3, weight_decay=0.05),
+ clip_grad=dict(max_norm=5.0),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.9,
+ custom_keys={
+ '.norm': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.absolute_pos_embed': dict(decay_mult=0.0),
+ '.relative_position_bias_table': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=2.5e-7 / 1.25e-3,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=80,
+ eta_min=2.5e-7 * 2048 / 512,
+ by_epoch=True,
+ begin=20,
+ end=100,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+ logger=dict(type='LoggerHook', interval=100))
+
+randomness = dict(seed=0)
diff --git a/configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py b/configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7325f03d6b495b9b775f4e2cc3c33a06f6af7dd
--- /dev/null
+++ b/configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
@@ -0,0 +1,102 @@
+_base_ = [
+ '../../_base_/models/swin_transformer/base_224.py',
+ '../../_base_/datasets/imagenet_bs256_swin_192.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ backbone=dict(
+ img_size=224,
+ drop_path_rate=0.1,
+ stage_cfgs=dict(block_cfgs=dict(window_size=7)),
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer settings
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ optimizer=dict(type='AdamW', lr=5e-3, weight_decay=0.05),
+ clip_grad=dict(max_norm=5.0),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.9,
+ custom_keys={
+ '.norm': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.absolute_pos_embed': dict(decay_mult=0.0),
+ '.relative_position_bias_table': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=2.5e-7 / 1.25e-3,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=80,
+ eta_min=2.5e-7 * 2048 / 512,
+ by_epoch=True,
+ begin=20,
+ end=100,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+ logger=dict(type='LoggerHook', interval=100))
+
+randomness = dict(seed=0)
diff --git a/configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py b/configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6eafd84d3c3f3224567747bcf645114286394f0
--- /dev/null
+++ b/configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py
@@ -0,0 +1,105 @@
+_base_ = [
+ '../../_base_/models/swin_transformer/base_224.py',
+ '../../_base_/datasets/imagenet_bs256_swin_192.py',
+ '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=0.3333333333333333,
+ fill_color=[103.53, 116.28, 123.675],
+ fill_std=[57.375, 57.12, 58.395]),
+ dict(type='PackInputs')
+]
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+ backbone=dict(
+ arch='large',
+ img_size=224,
+ drop_path_rate=0.2,
+ stage_cfgs=dict(block_cfgs=dict(window_size=14)),
+ pad_small_map=True,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ head=dict(in_channels=1536))
+
+# optimizer settings
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ optimizer=dict(type='AdamW', lr=5e-3, weight_decay=0.05),
+ clip_grad=dict(max_norm=5.0),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.7,
+ custom_keys={
+ '.norm': dict(decay_mult=0.0),
+ '.bias': dict(decay_mult=0.0),
+ '.absolute_pos_embed': dict(decay_mult=0.0),
+ '.relative_position_bias_table': dict(decay_mult=0.0)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=2.5e-7 / 1.25e-3,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=100,
+ eta_min=1e-6,
+ by_epoch=True,
+ begin=20,
+ end=100,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+ logger=dict(type='LoggerHook', interval=100))
+
+randomness = dict(seed=0)
diff --git a/configs/simmim/metafile.yml b/configs/simmim/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..19d9446c45c5f86315cc61be206430ea7bd97643
--- /dev/null
+++ b/configs/simmim/metafile.yml
@@ -0,0 +1,115 @@
+Collections:
+ - Name: SimMIM
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ Training Resources: 16x A100 GPUs
+ Architecture:
+ - Swin
+ Paper:
+ Title: 'SimMIM: A Simple Framework for Masked Image Modeling'
+ URL: https://arxiv.org/abs/2111.09886
+ README: configs/simmim/README.md
+
+Models:
+ - Name: simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px
+ Metadata:
+ Epochs: 100
+ Batch Size: 2048
+ FLOPs: 18832161792
+ Parameters: 89874104
+ Training Data: ImageNet-1k
+ In Collection: SimMIM
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth
+ Config: configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
+ Downstream:
+ - swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px
+ - swin-base-w7_simmim-100e-pre_8xb256-coslr-100e_in1k
+ - Name: simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px
+ Metadata:
+ Epochs: 100
+ Batch Size: 2048
+ FLOPs: 18832161792
+ Parameters: 89874104
+ Training Data: ImageNet-1k
+ In Collection: SimMIM
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.pth
+ Config: configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py
+ Downstream:
+ - swin-base-w6_simmim-800e-pre_8xb256-coslr-100e_in1k-192px
+ - Name: simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px
+ Metadata:
+ Epochs: 100
+ Batch Size: 2048
+ FLOPs: 55849130496
+ Parameters: 199920372
+ Training Data: ImageNet-1k
+ In Collection: SimMIM
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.pth
+ Config: configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py
+ Downstream:
+ - swin-large-w14_simmim-800e-pre_8xb256-coslr-100e_in1k
+ - Name: swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px
+ Metadata:
+ Epochs: 100
+ Batch Size: 2048
+ FLOPs: 11303976960
+ Parameters: 87750176
+ Training Data: ImageNet-1k
+ In Collection: SimMIM
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.7
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.pth
+ Config: configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py
+ - Name: swin-base-w7_simmim-100e-pre_8xb256-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 2048
+ FLOPs: 15466852352
+ Parameters: 87768224
+ Training Data: ImageNet-1k
+ In Collection: SimMIM
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.5
+ Weights: null
+ Config: configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
+ - Name: swin-base-w6_simmim-800e-pre_8xb256-coslr-100e_in1k-192px
+ Metadata:
+ Epochs: 100
+ Batch Size: 2048
+ FLOPs: 15466852352
+ Parameters: 87768224
+ Training Data: ImageNet-1k
+ In Collection: SimMIM
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.8
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k-224/swin-base_ft-8xb256-coslr-100e_in1k-224_20221208-155cc6e6.pth
+ Config: configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
+ - Name: swin-large-w14_simmim-800e-pre_8xb256-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 2048
+ FLOPs: 38853083136
+ Parameters: 196848316
+ Training Data: ImageNet-1k
+ In Collection: SimMIM
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.8
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224_20220916-d4865790.pth
+ Config: configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py
diff --git a/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-100e_in1k-192px.py b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-100e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed9dfdb85d6ebb0e87f18257a9320bc9166f4c5e
--- /dev/null
+++ b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-100e_in1k-192px.py
@@ -0,0 +1,4 @@
+_base_ = 'simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py'
+
+# dataset 16 GPUs x 128
+train_dataloader = dict(batch_size=128)
diff --git a/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..560714b7d6a74a22f6d8bb4358a0977fc73909e8
--- /dev/null
+++ b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py
@@ -0,0 +1,64 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs256_simmim_192.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='SimMIM',
+ backbone=dict(
+ type='SimMIMSwinTransformer',
+ arch='base',
+ img_size=192,
+ stage_cfgs=dict(block_cfgs=dict(window_size=6))),
+ neck=dict(
+ type='SimMIMLinearDecoder', in_channels=128 * 2**3, encoder_stride=32),
+ head=dict(
+ type='SimMIMHead',
+ patch_size=4,
+ loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3)))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ optimizer=dict(
+ type='AdamW',
+ lr=1e-4 * 2048 / 512,
+ betas=(0.9, 0.999),
+ weight_decay=0.05),
+ clip_grad=dict(max_norm=5.0),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'relative_position_bias_table': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=5e-7 / 1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='MultiStepLR',
+ milestones=[700],
+ by_epoch=True,
+ begin=10,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py b/configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0be14486a3e29b14b78e507108f57d803404b8f
--- /dev/null
+++ b/configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
@@ -0,0 +1,65 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs256_simmim_192.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='SimMIM',
+ backbone=dict(
+ type='SimMIMSwinTransformer',
+ arch='base',
+ img_size=192,
+ stage_cfgs=dict(block_cfgs=dict(window_size=6))),
+ neck=dict(
+ type='SimMIMLinearDecoder', in_channels=128 * 2**3, encoder_stride=32),
+ head=dict(
+ type='SimMIMHead',
+ patch_size=4,
+ loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3)))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ optimizer=dict(
+ type='AdamW',
+ lr=2e-4 * 2048 / 512,
+ betas=(0.9, 0.999),
+ weight_decay=0.05),
+ clip_grad=dict(max_norm=5.0),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'relative_position_bias_table': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-6 / 2e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=90,
+ eta_min=1e-5 * 2048 / 512,
+ by_epoch=True,
+ begin=10,
+ end=100,
+ convert_to_iter_based=True)
+]
+
+# runtime
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py b/configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0563023bd796e640c5c4caff2b9dc9bc555227c4
--- /dev/null
+++ b/configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py
@@ -0,0 +1,65 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs256_simmim_192.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='SimMIM',
+ backbone=dict(
+ type='SimMIMSwinTransformer',
+ arch='large',
+ img_size=192,
+ stage_cfgs=dict(block_cfgs=dict(window_size=12)),
+ pad_small_map=True),
+ neck=dict(
+ type='SimMIMLinearDecoder', in_channels=192 * 2**3, encoder_stride=32),
+ head=dict(
+ type='SimMIMHead',
+ patch_size=4,
+ loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3)))
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ optimizer=dict(
+ type='AdamW',
+ lr=1e-4 * 2048 / 512,
+ betas=(0.9, 0.999),
+ weight_decay=0.05),
+ clip_grad=dict(max_norm=5.0),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'norm': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'absolute_pos_embed': dict(decay_mult=0.),
+ 'relative_position_bias_table': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=5e-7 / 1e-4,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ convert_to_iter_based=True),
+ dict(
+ type='MultiStepLR',
+ milestones=[700],
+ by_epoch=True,
+ begin=10,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/simsiam/README.md b/configs/simsiam/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..117e45bf7bec09a86558d3372663440d5859155f
--- /dev/null
+++ b/configs/simsiam/README.md
@@ -0,0 +1,87 @@
+# SimSiam
+
+> [Exploring simple siamese representation learning](https://arxiv.org/abs/2011.10566)
+
+
+
+## Abstract
+
+Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our “SimSiam” method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('simsiam_resnet50_8xb32-coslr-100e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------: | :----------------------------------------------------------------------------------------: |
+| `simsiam_resnet50_8xb32-coslr-100e_in1k` | 38.20 | 4.11 | [config](simsiam_resnet50_8xb32-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.json) |
+| `simsiam_resnet50_8xb32-coslr-200e_in1k` | 38.20 | 4.11 | [config](simsiam_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k` | [SIMSIAM 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.pth) | 25.56 | 4.11 | 68.30 | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.json) |
+| `resnet50_simsiam-200e-pre_8xb512-linear-coslr-90e_in1k` | [SIMSIAM 200-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.pth) | 25.56 | 4.11 | 69.80 | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-519b5135.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-519b5135.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{chen2021exploring,
+ title={Exploring simple siamese representation learning},
+ author={Chen, Xinlei and He, Kaiming},
+ booktitle={CVPR},
+ year={2021}
+}
+```
diff --git a/configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+ '../../_base_/models/resnet50.py',
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+ '../../_base_/default_runtime.py',
+]
+
+model = dict(
+ backbone=dict(
+ frozen_stages=4,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simsiam/metafile.yml b/configs/simsiam/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40f6706511cf6cf49f8b65153ffd575348abeeca
--- /dev/null
+++ b/configs/simsiam/metafile.yml
@@ -0,0 +1,72 @@
+Collections:
+ - Name: SimSiam
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Architecture:
+ - ResNet
+ Paper:
+ Title: Exploring simple siamese representation learning
+ URL: https://arxiv.org/abs/2011.10566
+ README: configs/simsiam/README.md
+
+Models:
+ - Name: simsiam_resnet50_8xb32-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 256
+ FLOPs: 4109364224
+ Parameters: 38199360
+ Training Data: ImageNet-1k
+ In Collection: SimSiam
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.pth
+ Config: configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
+ Downstream:
+ - resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k
+ - Name: simsiam_resnet50_8xb32-coslr-200e_in1k
+ Metadata:
+ Epochs: 200
+ Batch Size: 256
+ FLOPs: 4109364224
+ Parameters: 38199360
+ Training Data: ImageNet-1k
+ In Collection: SimSiam
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.pth
+ Config: configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py
+ Downstream:
+ - resnet50_simsiam-200e-pre_8xb512-linear-coslr-90e_in1k
+ - Name: resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 4096
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: SimSiam
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 68.3
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.pth
+ Config: configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
+ - Name: resnet50_simsiam-200e-pre_8xb512-linear-coslr-90e_in1k
+ Metadata:
+ Epochs: 90
+ Batch Size: 4096
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: SimSiam
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 69.8
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-519b5135.pth
+ Config: configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py b/configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad19af6acaa530f0a0c3120034fa836cec965642
--- /dev/null
+++ b/configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_mocov2.py',
+ '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='SimSiam',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=True),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=2048,
+ out_channels=2048,
+ num_layers=3,
+ with_last_bn_affine=False,
+ with_avg_pool=True),
+ head=dict(
+ type='LatentPredictHead',
+ loss=dict(type='CosineSimilarityLoss'),
+ predictor=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=512,
+ out_channels=2048,
+ with_avg_pool=False,
+ with_last_bn=False,
+ with_last_bias=True)),
+)
+
+# optimizer
+# set base learning rate
+lr = 0.05
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=lr, weight_decay=1e-4, momentum=0.9),
+ paramwise_cfg=dict(custom_keys={'predictor': dict(fix_lr=True)}))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(type='CosineAnnealingLR', T_max=100, by_epoch=True, begin=0, end=100)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# additional hooks
+custom_hooks = [
+ dict(type='SimSiamHook', priority='HIGH', fix_pred_lr=True, lr=lr)
+]
diff --git a/configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py b/configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa3b2bbf5eb0b2f6c9b6907e78d189c13ea00cae
--- /dev/null
+++ b/configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,52 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_mocov2.py',
+ '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ type='SimSiam',
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=True),
+ neck=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=2048,
+ out_channels=2048,
+ num_layers=3,
+ with_last_bn_affine=False,
+ with_avg_pool=True),
+ head=dict(
+ type='LatentPredictHead',
+ loss=dict(type='CosineSimilarityLoss'),
+ predictor=dict(
+ type='NonLinearNeck',
+ in_channels=2048,
+ hid_channels=512,
+ out_channels=2048,
+ with_avg_pool=False,
+ with_last_bn=False,
+ with_last_bias=True)),
+)
+
+# optimizer
+# set base learning rate
+lr = 0.05
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=lr, weight_decay=1e-4, momentum=0.9),
+ paramwise_cfg=dict(custom_keys={'predictor': dict(fix_lr=True)}))
+
+# runtime settings
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# additional hooks
+custom_hooks = [
+ dict(type='SimSiamHook', priority='HIGH', fix_pred_lr=True, lr=lr)
+]
diff --git a/configs/spark/README.md b/configs/spark/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..60f510e959dacac9fa48a5e0495be63e4fc1a03a
--- /dev/null
+++ b/configs/spark/README.md
@@ -0,0 +1,87 @@
+# SparK
+
+> [Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling](https://arxiv.org/abs/2301.03580)
+
+
+
+## Abstract
+
+We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, random-masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method called Sparse masKed modeling (SparK) is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). Improvements on object detection and instance segmentation are more substantial (up to +3.5%), verifying the strong transferability of features learned. We also find its favorable scaling behavior by observing more gains on larger models. All this evidence reveals a promising future of generative pre-training on convnets. Codes and models are released at https://github.com/keyu-tian/SparK.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_spark-pre_300e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('spark_sparse-resnet50_800e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------------------------: | :----------------------------------------------------------------------: |
+| `spark_sparse-resnet50_800e_in1k` | 37.97 | 4.10 | [config](spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.json) |
+| `spark_sparse-convnextv2-tiny_800e_in1k` | 39.73 | 4.47 | [config](spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------ | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :-----------------------------------------: |
+| `resnet50_spark-pre_300e_in1k` | [SPARK](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.pth) | 23.52 | 1.31 | 80.10 | 94.90 | [config](benchmarks/resnet50_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.json) |
+| `convnextv2-tiny_spark-pre_300e_in1k` | [SPARK](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.pth) | 28.64 | 4.47 | 82.80 | 96.30 | [config](benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark//spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k_20230612-ffc78743.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark//spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k_20230612-ffc78743.json) |
+
+## Citation
+
+```bibtex
+@Article{tian2023designing,
+ author = {Keyu Tian and Yi Jiang and Qishuai Diao and Chen Lin and Liwei Wang and Zehuan Yuan},
+ title = {Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling},
+ journal = {arXiv:2301.03580},
+ year = {2023},
+}
+```
diff --git a/configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py b/configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..95ef81f16a8d1173702ccfe3313f1e85bdd561ef
--- /dev/null
+++ b/configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,122 @@
+_base_ = [
+ '../../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../../_base_/default_runtime.py',
+]
+
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='NumpyToPIL', to_rgb=True),
+ dict(
+ type='torchvision/TrivialAugmentWide',
+ num_magnitude_bins=31,
+ interpolation='bicubic',
+ fill=None),
+ dict(type='PILToNumpy', to_bgr=True),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+ dataset=dict(pipeline=train_pipeline),
+ sampler=dict(type='RepeatAugSampler', shuffle=True),
+)
+
+# Model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ConvNeXt',
+ arch='tiny',
+ drop_path_rate=0.1,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ init_cfg=dict(type='TruncNormal', layer='Linear', std=.02, bias=0.),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+custom_hooks = [
+ dict(
+ type='EMAHook',
+ momentum=1e-4,
+ evaluate_on_origin=True,
+ priority='ABOVE_NORMAL')
+]
+
+# schedule settings
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW', lr=3.2e-3, betas=(0.9, 0.999), weight_decay=0.05),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.7,
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ flat_decay_mult=0.0))
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=280,
+ eta_min=1.0e-5,
+ by_epoch=True,
+ begin=20,
+ end=300)
+]
+train_cfg = dict(by_epoch=True, max_epochs=300)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ # only keeps the latest 2 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py b/configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d7527ce2a545949a6395d847631b5c4484af398
--- /dev/null
+++ b/configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,107 @@
+_base_ = [
+ '../../_base_/models/resnet50.py',
+ '../../_base_/datasets/imagenet_bs256_rsb_a12.py',
+ '../../_base_/default_runtime.py'
+]
+# modification is based on ResNets RSB settings
+data_preprocessor = dict(
+ num_classes=1000,
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='NumpyToPIL', to_rgb=True),
+ dict(
+ type='torchvision/TrivialAugmentWide',
+ num_magnitude_bins=31,
+ interpolation='bicubic',
+ fill=None),
+ dict(type='PILToNumpy', to_bgr=True),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model settings
+model = dict(
+ backbone=dict(
+ norm_cfg=dict(type='SyncBN', requires_grad=True),
+ drop_path_rate=0.05,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+ head=dict(
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, use_sigmoid=True)),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.1),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# schedule settings
+# optimizer
+optim_wrapper = dict(
+ optimizer=dict(
+ type='Lamb',
+ lr=0.016,
+ weight_decay=0.02,
+ ),
+ constructor='LearningRateDecayOptimWrapperConstructor',
+ paramwise_cfg=dict(
+ layer_decay_rate=0.7,
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ flat_decay_mult=0.0))
+
+# learning policy
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=295,
+ eta_min=1.0e-6,
+ by_epoch=True,
+ begin=5,
+ end=300)
+]
+train_cfg = dict(by_epoch=True, max_epochs=300)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+ # only keeps the latest 2 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/spark/metafile.yml b/configs/spark/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..81ca3a7033e7eeac1ef88a852613f4866854f625
--- /dev/null
+++ b/configs/spark/metafile.yml
@@ -0,0 +1,73 @@
+Collections:
+ - Name: SparK
+ Metadata:
+ Architecture:
+ - Dense Connections
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ Paper:
+ Title: 'Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling'
+ URL: https://arxiv.org/abs/2301.03580
+ README: configs/spark/README.md
+ Code:
+ URL: null
+ Version: null
+
+Models:
+ - Name: spark_sparse-resnet50_800e_in1k
+ Metadata:
+ FLOPs: 4100000000
+ Parameters: 37971000
+ Training Data:
+ - ImageNet-1k
+ In Collection: SparK
+ Results: null
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.pth
+ Config: configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
+ Downstream:
+ - resnet50_spark-pre_300e_in1k
+ - Name: resnet50_spark-pre_300e_in1k
+ Metadata:
+ FLOPs: 1310000000
+ Parameters: 23520000
+ Training Data:
+ - ImageNet-1k
+ In Collection: SparK
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.1
+ Top 5 Accuracy: 94.9
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.pth
+ Config: configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py
+
+ - Name: spark_sparse-convnextv2-tiny_800e_in1k
+ Metadata:
+ FLOPs: 4470000000
+ Parameters: 39732000
+ Training Data:
+ - ImageNet-1k
+ In Collection: SparK
+ Results: null
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.pth
+ Config: configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py
+ Downstream:
+ - convnextv2-tiny_spark-pre_300e_in1k
+ - Name: convnextv2-tiny_spark-pre_300e_in1k
+ Metadata:
+ FLOPs: 4469631744
+ Parameters: 28635496
+ Training Data:
+ - ImageNet-1k
+ In Collection: SparK
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.8
+ Top 5 Accuracy: 96.3
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark//spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k_20230612-ffc78743.pth
+ Config: configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py
diff --git a/configs/spark/spark_sparse-convnext-small_16xb256-amp-coslr-800e_in1k.py b/configs/spark/spark_sparse-convnext-small_16xb256-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cefb5b93ae8bd79e501b2c6ab6b874c11751b44
--- /dev/null
+++ b/configs/spark/spark_sparse-convnext-small_16xb256-amp-coslr-800e_in1k.py
@@ -0,0 +1,81 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset 8 x 512
+train_dataloader = dict(batch_size=256, num_workers=8)
+
+# model settings
+model = dict(
+ type='SparK',
+ input_size=224,
+ downsample_raito=32,
+ mask_ratio=0.6,
+ enc_dec_norm_cfg=dict(type='SparseLN2d', eps=1e-6),
+ enc_dec_norm_dim=768,
+ backbone=dict(
+ type='SparseConvNeXt',
+ arch='small',
+ drop_path_rate=0.2,
+ out_indices=(0, 1, 2, 3),
+ gap_before_output=False),
+ neck=dict(
+ type='SparKLightDecoder',
+ feature_dim=512,
+ upsample_ratio=32, # equal to downsample_raito
+ mid_channels=0,
+ last_act=False),
+ head=dict(
+ type='SparKPretrainHead',
+ loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optimizer = dict(
+ type='Lamb', lr=2e-4 * 4096 / 512, betas=(0.9, 0.95), weight_decay=0.04)
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ optimizer=optimizer,
+ clip_grad=dict(max_norm=5.0),
+ paramwise_cfg=dict(
+ bias_decay_mult=0.0,
+ flat_decay_mult=0.0,
+ custom_keys={
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=760,
+ by_epoch=True,
+ begin=40,
+ end=800,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingWeightDecay',
+ eta_min=0.2,
+ T_max=800,
+ by_epoch=True,
+ begin=0,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ logger=dict(type='LoggerHook', interval=100),
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py b/configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a1afc80821abb06fcafe956d1e3c3b919ab0f20
--- /dev/null
+++ b/configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py
@@ -0,0 +1,84 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset 16 x 256
+train_dataloader = dict(batch_size=256, num_workers=8)
+
+# model settings, use ConvNeXt V2
+model = dict(
+ type='SparK',
+ input_size=224,
+ downsample_raito=32,
+ mask_ratio=0.6,
+ enc_dec_norm_cfg=dict(type='SparseLN2d', eps=1e-6),
+ enc_dec_norm_dim=768,
+ backbone=dict(
+ type='SparseConvNeXt',
+ arch='tiny',
+ drop_path_rate=0.2,
+ out_indices=(0, 1, 2, 3),
+ gap_before_output=False,
+ layer_scale_init_value=0.,
+ use_grn=True,
+ ),
+ neck=dict(
+ type='SparKLightDecoder',
+ feature_dim=512,
+ upsample_ratio=32, # equal to downsample_raito
+ mid_channels=0,
+ last_act=False),
+ head=dict(
+ type='SparKPretrainHead',
+ loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optimizer = dict(
+ type='Lamb', lr=2e-4 * 4096 / 512, betas=(0.9, 0.95), weight_decay=0.04)
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ optimizer=optimizer,
+ clip_grad=dict(max_norm=5.0),
+ paramwise_cfg=dict(
+ bias_decay_mult=0.0,
+ flat_decay_mult=0.0,
+ custom_keys={
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=20,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=780,
+ by_epoch=True,
+ begin=20,
+ end=800,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingWeightDecay',
+ eta_min=0.2,
+ T_max=800,
+ by_epoch=True,
+ begin=0,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ logger=dict(type='LoggerHook', interval=100),
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-1600e_in1k.py b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..10fc67574b705d2181f74db3d9d839a1812731e1
--- /dev/null
+++ b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,30 @@
+_base_ = 'spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py'
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=1560,
+ by_epoch=True,
+ begin=40,
+ end=1600,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingWeightDecay',
+ eta_min=0.2,
+ T_max=1600,
+ by_epoch=True,
+ begin=0,
+ end=1600,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(max_epochs=1600)
diff --git a/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..864f616209361ba63158f64d66ffb06c2693e9e8
--- /dev/null
+++ b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,80 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset 8 x 512
+train_dataloader = dict(batch_size=512, num_workers=8)
+
+# model settings
+model = dict(
+ type='SparK',
+ input_size=224,
+ downsample_raito=32,
+ mask_ratio=0.6,
+ enc_dec_norm_cfg=dict(type='SparseSyncBatchNorm2d'),
+ enc_dec_norm_dim=2048,
+ backbone=dict(
+ type='SparseResNet',
+ depth=50,
+ out_indices=(0, 1, 2, 3),
+ drop_path_rate=0.05),
+ neck=dict(
+ type='SparKLightDecoder',
+ feature_dim=512,
+ upsample_ratio=32, # equal to downsample_raito
+ mid_channels=0,
+ last_act=False),
+ head=dict(
+ type='SparKPretrainHead',
+ loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optimizer = dict(
+ type='Lamb', lr=2e-4 * 4096 / 512, betas=(0.9, 0.95), weight_decay=0.04)
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ optimizer=optimizer,
+ clip_grad=dict(max_norm=5.0),
+ paramwise_cfg=dict(
+ bias_decay_mult=0.0,
+ flat_decay_mult=0.0,
+ custom_keys={
+ 'mask_token': dict(decay_mult=0.),
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=760,
+ by_epoch=True,
+ begin=40,
+ end=800,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingWeightDecay',
+ eta_min=0.2,
+ T_max=800,
+ by_epoch=True,
+ begin=0,
+ end=800,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+ logger=dict(type='LoggerHook', interval=100),
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/swav/README.md b/configs/swav/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fdcdfeb25e3c454d084bbf2d8a7b3d685c35c9fc
--- /dev/null
+++ b/configs/swav/README.md
@@ -0,0 +1,85 @@
+# SwAV
+
+> [Unsupervised Learning of Visual Features by Contrasting Cluster Assignments](https://arxiv.org/abs/2006.09882)
+
+
+
+## Abstract
+
+Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a “swapped” prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_swav-pre_8xb32-linear-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :----------------------------------------------------- | :--------: | :-------: | :------------------------------------------------------------: | :---------------------------------------------------------------: |
+| `swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px` | 28.35 | 4.11 | [config](swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_swav-pre_8xb32-linear-coslr-100e_in1k` | [SWAV](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.pth) | 25.56 | 4.11 | 70.50 | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.json) |
+
+## Citation
+
+```bibtex
+@article{caron2020unsupervised,
+ title={Unsupervised Learning of Visual Features by Contrasting Cluster Assignments},
+ author={Caron, Mathilde and Misra, Ishan and Mairal, Julien and Goyal, Priya and Bojanowski, Piotr and Joulin, Armand},
+ booktitle={NeurIPS},
+ year={2020}
+}
+```
diff --git a/configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+ '../../_base_/models/resnet50.py',
+ '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+ '../../_base_/default_runtime.py',
+]
+
+model = dict(
+ backbone=dict(
+ frozen_stages=4,
+ init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/swav/metafile.yml b/configs/swav/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5bc1252ad1ed6528d28847b728b85f3e91e7d0b9
--- /dev/null
+++ b/configs/swav/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+ - Name: SwAV
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - LARS
+ Training Resources: 8x V100 GPUs
+ Architecture:
+ - ResNet
+ - SwAV
+ Paper:
+ Title: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
+ URL: https://arxiv.org/abs/2006.09882
+ README: configs/swav/README.md
+
+Models:
+ - Name: swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px
+ Metadata:
+ Epochs: 200
+ Batch Size: 256
+ FLOPs: 4109364224
+ Parameters: 28354752
+ Training Data: ImageNet-1k
+ In Collection: SwAV
+ Results: null
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.pth
+ Config: configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
+ Downstream:
+ - resnet50_swav-pre_8xb32-linear-coslr-100e_in1k
+ - Name: resnet50_swav-pre_8xb32-linear-coslr-100e_in1k
+ Metadata:
+ Epochs: 100
+ Batch Size: 256
+ FLOPs: 4109464576
+ Parameters: 25557032
+ Training Data: ImageNet-1k
+ In Collection: SwAV
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 70.5
+ Weights: https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.pth
+ Config: configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py b/configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebb9ead92ef84387aa8715c013be36eebb661dd8
--- /dev/null
+++ b/configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
@@ -0,0 +1,159 @@
+_base_ = [
+ '../_base_/schedules/imagenet_lars_coslr_200e.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+ type='SelfSupDataPreprocessor',
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+
+num_crops = [2, 6]
+color_distort_strength = 1.0
+view_pipeline1 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.14, 1.),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.8 * color_distort_strength,
+ contrast=0.8 * color_distort_strength,
+ saturation=0.8 * color_distort_strength,
+ hue=0.2 * color_distort_strength)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=0.5),
+ dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+ dict(
+ type='RandomResizedCrop',
+ scale=96,
+ crop_ratio_range=(0.05, 0.14),
+ backend='pillow'),
+ dict(
+ type='RandomApply',
+ transforms=[
+ dict(
+ type='ColorJitter',
+ brightness=0.8 * color_distort_strength,
+ contrast=0.8 * color_distort_strength,
+ saturation=0.8 * color_distort_strength,
+ hue=0.2 * color_distort_strength)
+ ],
+ prob=0.8),
+ dict(
+ type='RandomGrayscale',
+ prob=0.2,
+ keep_channels=True,
+ channel_weights=(0.114, 0.587, 0.2989)),
+ dict(
+ type='GaussianBlur',
+ magnitude_range=(0.1, 2.0),
+ magnitude_std='inf',
+ prob=0.5),
+ dict(type='RandomFlip', prob=0.5),
+]
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='MultiView',
+ num_views=num_crops,
+ transforms=[view_pipeline1, view_pipeline2]),
+ dict(type='PackInputs')
+]
+
+batch_size = 32
+train_dataloader = dict(
+ batch_size=batch_size,
+ num_workers=8,
+ drop_last=True,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/train.txt',
+ data_prefix=dict(img_path='train/'),
+ pipeline=train_pipeline))
+
+# model settings
+model = dict(
+ type='SwAV',
+ data_preprocessor=dict(
+ mean=(123.675, 116.28, 103.53),
+ std=(58.395, 57.12, 57.375),
+ to_rgb=True),
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ norm_cfg=dict(type='SyncBN'),
+ zero_init_residual=True),
+ neck=dict(
+ type='SwAVNeck',
+ in_channels=2048,
+ hid_channels=2048,
+ out_channels=128,
+ with_avg_pool=True),
+ head=dict(
+ type='SwAVHead',
+ loss=dict(
+ type='SwAVLoss',
+ feat_dim=128, # equal to neck['out_channels']
+ epsilon=0.05,
+ temperature=0.1,
+ num_crops=num_crops,
+ )))
+
+# optimizer
+optim_wrapper = dict(type='OptimWrapper', optimizer=dict(type='LARS', lr=0.6))
+find_unused_parameters = True
+
+# learning policy
+param_scheduler = [
+ dict(
+ type='CosineAnnealingLR',
+ T_max=200,
+ eta_min=6e-4,
+ by_epoch=True,
+ begin=0,
+ end=200,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# additional hooks
+custom_hooks = [
+ dict(
+ type='SwAVHook',
+ priority='VERY_HIGH',
+ batch_size=batch_size,
+ epoch_queue_starts=15,
+ crops_for_assign=[0, 1],
+ feat_dim=128,
+ queue_length=3840,
+ frozen_layers_cfg=dict(prototypes=5005))
+]
diff --git a/configs/swin_transformer/README.md b/configs/swin_transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1d41f13a52554d7dd5896d284cd22b47b6b1fc8a
--- /dev/null
+++ b/configs/swin_transformer/README.md
@@ -0,0 +1,111 @@
+# Swin-Transformer
+
+> [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
+
+
+
+## Introduction
+
+**Swin Transformer** (the name **Swin** stands for Shifted window) is initially described in [the paper](https://arxiv.org/pdf/2103.14030.pdf), which capably serves as a general-purpose backbone for computer vision. It is basically a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
+
+Swin Transformer achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with **Shifted windows**. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('swin-tiny_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('swin-tiny_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/swin_transformer/swin-tiny_16xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/swin_transformer/swin-tiny_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :------------------------------------------------------------------: |
+| `swin-tiny_16xb64_in1k` | From scratch | 28.29 | 4.36 | 81.18 | 95.61 | [config](swin-tiny_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925.json) |
+| `swin-small_16xb64_in1k` | From scratch | 49.61 | 8.52 | 83.02 | 96.29 | [config](swin-small_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219-7f9d988b.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219.json) |
+| `swin-base_16xb64_in1k` | From scratch | 87.77 | 15.14 | 83.36 | 96.44 | [config](swin-base_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742.json) |
+| `swin-tiny_3rdparty_in1k`\* | From scratch | 28.29 | 4.36 | 81.18 | 95.52 | [config](swin-tiny_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_tiny_patch4_window7_224-160bb0a5.pth) |
+| `swin-small_3rdparty_in1k`\* | From scratch | 49.61 | 8.52 | 83.21 | 96.25 | [config](swin-small_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_small_patch4_window7_224-cc7a01c9.pth) |
+| `swin-base_3rdparty_in1k`\* | From scratch | 87.77 | 15.14 | 83.42 | 96.44 | [config](swin-base_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224-4670dd19.pth) |
+| `swin-base_3rdparty_in1k-384`\* | From scratch | 87.90 | 44.49 | 84.49 | 96.95 | [config](swin-base_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384-02c598a4.pth) |
+| `swin-base_in21k-pre-3rdparty_in1k`\* | From scratch | 87.77 | 15.14 | 85.16 | 97.50 | [config](swin-base_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224_22kto1k-f967f799.pth) |
+| `swin-base_in21k-pre-3rdparty_in1k-384`\* | From scratch | 87.90 | 44.49 | 86.44 | 98.05 | [config](swin-base_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384_22kto1k-d59b0d1d.pth) |
+| `swin-large_in21k-pre-3rdparty_in1k`\* | From scratch | 196.53 | 34.04 | 86.24 | 97.88 | [config](swin-large_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window7_224_22kto1k-5f0996db.pth) |
+| `swin-large_in21k-pre-3rdparty_in1k-384`\* | From scratch | 196.74 | 100.04 | 87.25 | 98.25 | [config](swin-large_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window12_384_22kto1k-0a40944b.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on CUB-200-2011
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
+| :-------------------------- | :----------: | :--------: | :-------: | :-------: | :------------------------------------: | :---------------------------------------------------------------------------------------------: |
+| `swin-large_8xb8_cub-384px` | From scratch | 195.51 | 100.04 | 91.87 | [config](swin-large_8xb8_cub-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.json) |
+
+## Citation
+
+```bibtex
+@article{liu2021Swin,
+ title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
+ author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
+ journal={arXiv preprint arXiv:2103.14030},
+ year={2021}
+}
+```
diff --git a/configs/swin_transformer/metafile.yml b/configs/swin_transformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8bff599267afe52a0904c106be4fcd8c76f6e4bf
--- /dev/null
+++ b/configs/swin_transformer/metafile.yml
@@ -0,0 +1,201 @@
+Collections:
+ - Name: Swin-Transformer
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ - Weight Decay
+ Training Resources: 16x V100 GPUs
+ Epochs: 300
+ Batch Size: 1024
+ Architecture:
+ - Shift Window Multihead Self Attention
+ Paper:
+ URL: https://arxiv.org/abs/2103.14030
+ Title: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"
+ README: configs/swin_transformer/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/swin_transformer.py#L176
+ Version: v0.15.0
+
+Models:
+ - Name: swin-tiny_16xb64_in1k
+ Metadata:
+ FLOPs: 4360000000
+ Parameters: 28290000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.18
+ Top 5 Accuracy: 95.61
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth
+ Config: configs/swin_transformer/swin-tiny_16xb64_in1k.py
+ - Name: swin-small_16xb64_in1k
+ Metadata:
+ FLOPs: 8520000000
+ Parameters: 49610000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.02
+ Top 5 Accuracy: 96.29
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219-7f9d988b.pth
+ Config: configs/swin_transformer/swin-small_16xb64_in1k.py
+ - Name: swin-base_16xb64_in1k
+ Metadata:
+ FLOPs: 15140000000
+ Parameters: 87770000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.36
+ Top 5 Accuracy: 96.44
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth
+ Config: configs/swin_transformer/swin-base_16xb64_in1k.py
+ - Name: swin-tiny_3rdparty_in1k
+ Metadata:
+ FLOPs: 4360000000
+ Parameters: 28290000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.18
+ Top 5 Accuracy: 95.52
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_tiny_patch4_window7_224-160bb0a5.pth
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth
+ Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+ Config: configs/swin_transformer/swin-tiny_16xb64_in1k.py
+ - Name: swin-small_3rdparty_in1k
+ Metadata:
+ FLOPs: 8520000000
+ Parameters: 49610000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.21
+ Top 5 Accuracy: 96.25
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_small_patch4_window7_224-cc7a01c9.pth
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_small_patch4_window7_224.pth
+ Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+ Config: configs/swin_transformer/swin-small_16xb64_in1k.py
+ - Name: swin-base_3rdparty_in1k
+ Metadata:
+ FLOPs: 15140000000
+ Parameters: 87770000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.42
+ Top 5 Accuracy: 96.44
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224-4670dd19.pth
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224.pth
+ Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+ Config: configs/swin_transformer/swin-base_16xb64_in1k.py
+ - Name: swin-base_3rdparty_in1k-384
+ Metadata:
+ FLOPs: 44490000000
+ Parameters: 87900000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.49
+ Top 5 Accuracy: 96.95
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384-02c598a4.pth
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384.pth
+ Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+ Config: configs/swin_transformer/swin-base_16xb64_in1k-384px.py
+ - Name: swin-base_in21k-pre-3rdparty_in1k
+ Metadata:
+ FLOPs: 15140000000
+ Parameters: 87770000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.16
+ Top 5 Accuracy: 97.50
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224_22kto1k-f967f799.pth
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224_22kto1k.pth
+ Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+ Config: configs/swin_transformer/swin-base_16xb64_in1k.py
+ - Name: swin-base_in21k-pre-3rdparty_in1k-384
+ Metadata:
+ FLOPs: 44490000000
+ Parameters: 87900000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.44
+ Top 5 Accuracy: 98.05
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384_22kto1k-d59b0d1d.pth
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384_22kto1k.pth
+ Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+ Config: configs/swin_transformer/swin-base_16xb64_in1k-384px.py
+ - Name: swin-large_in21k-pre-3rdparty_in1k
+ Metadata:
+ FLOPs: 34040000000
+ Parameters: 196530000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.24
+ Top 5 Accuracy: 97.88
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window7_224_22kto1k-5f0996db.pth
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window7_224_22kto1k.pth
+ Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+ Config: configs/swin_transformer/swin-large_16xb64_in1k.py
+ - Name: swin-large_in21k-pre-3rdparty_in1k-384
+ Metadata:
+ FLOPs: 100040000000
+ Parameters: 196740000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 87.25
+ Top 5 Accuracy: 98.25
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window12_384_22kto1k-0a40944b.pth
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window12_384_22kto1k.pth
+ Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+ Config: configs/swin_transformer/swin-large_16xb64_in1k-384px.py
+ - Name: swin-large_8xb8_cub-384px
+ Metadata:
+ FLOPs: 100040000000
+ Parameters: 195510000
+ In Collection: Swin-Transformer
+ Results:
+ - Dataset: CUB-200-2011
+ Metrics:
+ Top 1 Accuracy: 91.87
+ Task: Image Classification
+ Pretrain: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-large_3rdparty_in21k-384px.pth
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.pth
+ Config: configs/swin_transformer/swin-large_8xb8_cub-384px.py
diff --git a/configs/swin_transformer/swin-base_16xb64_in1k-384px.py b/configs/swin_transformer/swin-base_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..10f89921ff1ec6659509ccdee8e15cfe52395880
--- /dev/null
+++ b/configs/swin_transformer/swin-base_16xb64_in1k-384px.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/swin_transformer/base_384.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-base_16xb64_in1k.py b/configs/swin_transformer/swin-base_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..05a95b4483dd3764abbcf9e32b1291334e084099
--- /dev/null
+++ b/configs/swin_transformer/swin-base_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/swin_transformer/base_224.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-large_16xb64_in1k-384px.py b/configs/swin_transformer/swin-large_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ba52b3564704acfeb2c40eb39e1d4e5cf5bf573
--- /dev/null
+++ b/configs/swin_transformer/swin-large_16xb64_in1k-384px.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/swin_transformer/large_384.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-large_16xb64_in1k.py b/configs/swin_transformer/swin-large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..36121efca15f951a03d153b614d3e844cc8cad26
--- /dev/null
+++ b/configs/swin_transformer/swin-large_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/swin_transformer/large_224.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-large_8xb8_cub-384px.py b/configs/swin_transformer/swin-large_8xb8_cub-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2f10a6a292bc2485085a38c895b635a5944d04c
--- /dev/null
+++ b/configs/swin_transformer/swin-large_8xb8_cub-384px.py
@@ -0,0 +1,40 @@
+_base_ = [
+ '../_base_/models/swin_transformer/large_384.py',
+ '../_base_/datasets/cub_bs8_384.py',
+ '../_base_/schedules/cub_bs64.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+checkpoint = 'https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-large_3rdparty_in21k-384px.pth' # noqa
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ init_cfg=dict(
+ type='Pretrained', checkpoint=checkpoint, prefix='backbone')),
+ head=dict(num_classes=200, ))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(
+ _delete_=True,
+ type='AdamW',
+ lr=5e-6,
+ weight_decay=0.0005,
+ eps=1e-8,
+ betas=(0.9, 0.999)),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.absolute_pos_embed': dict(decay_mult=0.0),
+ '.relative_position_bias_table': dict(decay_mult=0.0)
+ }),
+ clip_grad=dict(max_norm=5.0),
+)
+
+default_hooks = dict(
+ # log every 20 intervals
+ logger=dict(type='LoggerHook', interval=20),
+ # save last three checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/swin_transformer/swin-small_16xb64_in1k.py b/configs/swin_transformer/swin-small_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c1a8e21a7f2cbc881cbde43c19af9cd10b7c2ba
--- /dev/null
+++ b/configs/swin_transformer/swin-small_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/swin_transformer/small_224.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-tiny_16xb64_in1k.py b/configs/swin_transformer/swin-tiny_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a1ce2508ab603b008640583de78c64d2f178620
--- /dev/null
+++ b/configs/swin_transformer/swin-tiny_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/swin_transformer/tiny_224.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer_v2/README.md b/configs/swin_transformer_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dd20548ae780ebca6cf0cc982ea71c782e369b52
--- /dev/null
+++ b/configs/swin_transformer_v2/README.md
@@ -0,0 +1,121 @@
+# Swin-Transformer V2
+
+> [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883)
+
+
+
+## Introduction
+
+**Swin Transformer V2** is a work on the scale up visual model based on [Swin Transformer](https://github.com/open-mmlab/mmpretrain/tree/main/configs/swin_transformer). In the visual field, We can not increase the performance by just simply scaling up the visual model like NLP models. The possible reasons mentioned in the article are:
+
+- Training instability when increasing the vision model
+- Migrating the model trained at low resolution to a larger scale resolution task
+- Too mush GPU memory
+
+To solve it, The following method improvements are proposed in the paper:
+
+- post normalization: layer normalization after self-attention layer and MLP block
+- scaled cosine attention approach: use cosine similarity to calculate the relationship between token pairs
+- log-spaced continuous position bias: redefine relative position encoding
+
+
+

+
+
+## Abstract
+
+
+
+Show the detailed Abstract
+
+
+
+Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.
+
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('swinv2-tiny-w8_3rdparty_in1k-256px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('swinv2-tiny-w8_3rdparty_in1k-256px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :---------------------------------------- | :--------: | :-------: | :----------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `swinv2-base-w12_3rdparty_in21k-192px`\* | 87.92 | 8.51 | [config](swinv2-base-w12_8xb128_in21k-192px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-base-w12_3rdparty_in21k-192px_20220803-f7dc9763.pth) |
+| `swinv2-large-w12_3rdparty_in21k-192px`\* | 196.74 | 19.04 | [config](swinv2-large-w12_8xb128_in21k-192px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-large-w12_3rdparty_in21k-192px_20220803-d9073fee.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------------: | :--------------------------------------------------: |
+| `swinv2-tiny-w8_3rdparty_in1k-256px`\* | From scratch | 28.35 | 4.35 | 81.76 | 95.87 | [config](swinv2-tiny-w8_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth) |
+| `swinv2-tiny-w16_3rdparty_in1k-256px`\* | From scratch | 28.35 | 4.40 | 82.81 | 96.23 | [config](swinv2-tiny-w16_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w16_3rdparty_in1k-256px_20220803-9651cdd7.pth) |
+| `swinv2-small-w8_3rdparty_in1k-256px`\* | From scratch | 49.73 | 8.45 | 83.74 | 96.60 | [config](swinv2-small-w8_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w8_3rdparty_in1k-256px_20220803-b01a4332.pth) |
+| `swinv2-small-w16_3rdparty_in1k-256px`\* | From scratch | 49.73 | 8.57 | 84.13 | 96.83 | [config](swinv2-small-w16_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w16_3rdparty_in1k-256px_20220803-b707d206.pth) |
+| `swinv2-base-w8_3rdparty_in1k-256px`\* | From scratch | 87.92 | 14.99 | 84.20 | 96.86 | [config](swinv2-base-w8_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w8_3rdparty_in1k-256px_20220803-8ff28f2b.pth) |
+| `swinv2-base-w16_3rdparty_in1k-256px`\* | From scratch | 87.92 | 15.14 | 84.60 | 97.05 | [config](swinv2-base-w16_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_3rdparty_in1k-256px_20220803-5a1886b7.pth) |
+| `swinv2-base-w16_in21k-pre_3rdparty_in1k-256px`\* | ImageNet-21k | 87.92 | 15.14 | 86.17 | 97.88 | [config](swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_in21k-pre_3rdparty_in1k-256px_20220803-8d7aa8ad.pth) |
+| `swinv2-base-w24_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 87.92 | 34.07 | 87.14 | 98.23 | [config](swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w24_in21k-pre_3rdparty_in1k-384px_20220803-44eb70f8.pth) |
+| `swinv2-large-w16_in21k-pre_3rdparty_in1k-256px`\* | ImageNet-21k | 196.75 | 33.86 | 86.93 | 98.06 | [config](swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w16_in21k-pre_3rdparty_in1k-256px_20220803-c40cbed7.pth) |
+| `swinv2-large-w24_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 196.75 | 76.20 | 87.59 | 98.27 | [config](swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w24_in21k-pre_3rdparty_in1k-384px_20220803-3b36c165.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{https://doi.org/10.48550/arxiv.2111.09883,
+ doi = {10.48550/ARXIV.2111.09883},
+ url = {https://arxiv.org/abs/2111.09883},
+ author = {Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, Yue and Zhang, Zheng and Dong, Li and Wei, Furu and Guo, Baining},
+ keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+ title = {Swin Transformer V2: Scaling Up Capacity and Resolution},
+ publisher = {arXiv},
+ year = {2021},
+ copyright = {Creative Commons Attribution 4.0 International}
+}
+```
diff --git a/configs/swin_transformer_v2/metafile.yml b/configs/swin_transformer_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..55a14cbab587f037d96583d3b0210ac3008b1118
--- /dev/null
+++ b/configs/swin_transformer_v2/metafile.yml
@@ -0,0 +1,206 @@
+Collections:
+ - Name: Swin-Transformer V2
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ - Weight Decay
+ Training Resources: 16x V100 GPUs
+ Epochs: 300
+ Batch Size: 1024
+ Architecture:
+ - Shift Window Multihead Self Attention
+ Paper:
+ URL: https://arxiv.org/abs/2111.09883
+ Title: "Swin Transformer V2: Scaling Up Capacity and Resolution"
+ README: configs/swin_transformer_v2/README.md
+
+Models:
+ - Name: swinv2-tiny-w8_3rdparty_in1k-256px
+ Metadata:
+ FLOPs: 4350000000
+ Parameters: 28350000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.76
+ Top 5 Accuracy: 95.87
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth
+ Config: configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_tiny_patch4_window8_256.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-tiny-w16_3rdparty_in1k-256px
+ Metadata:
+ FLOPs: 4400000000
+ Parameters: 28350000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.81
+ Top 5 Accuracy: 96.23
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w16_3rdparty_in1k-256px_20220803-9651cdd7.pth
+ Config: configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_tiny_patch4_window16_256.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-small-w8_3rdparty_in1k-256px
+ Metadata:
+ FLOPs: 8450000000
+ Parameters: 49730000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.74
+ Top 5 Accuracy: 96.6
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w8_3rdparty_in1k-256px_20220803-b01a4332.pth
+ Config: configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_small_patch4_window8_256.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-small-w16_3rdparty_in1k-256px
+ Metadata:
+ FLOPs: 8570000000
+ Parameters: 49730000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.13
+ Top 5 Accuracy: 96.83
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w16_3rdparty_in1k-256px_20220803-b707d206.pth
+ Config: configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_small_patch4_window16_256.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-base-w8_3rdparty_in1k-256px
+ Metadata:
+ FLOPs: 14990000000
+ Parameters: 87920000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.2
+ Top 5 Accuracy: 96.86
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w8_3rdparty_in1k-256px_20220803-8ff28f2b.pth
+ Config: configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window8_256.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-base-w16_3rdparty_in1k-256px
+ Metadata:
+ FLOPs: 15140000000
+ Parameters: 87920000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.6
+ Top 5 Accuracy: 97.05
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_3rdparty_in1k-256px_20220803-5a1886b7.pth
+ Config: configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window16_256.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-base-w16_in21k-pre_3rdparty_in1k-256px
+ Metadata:
+ Training Data: ImageNet-21k
+ FLOPs: 15140000000
+ Parameters: 87920000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.17
+ Top 5 Accuracy: 97.88
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_in21k-pre_3rdparty_in1k-256px_20220803-8d7aa8ad.pth
+ Config: configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12to16_192to256_22kto1k_ft.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-base-w24_in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ Training Data: ImageNet-21k
+ FLOPs: 34070000000
+ Parameters: 87920000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 87.14
+ Top 5 Accuracy: 98.23
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w24_in21k-pre_3rdparty_in1k-384px_20220803-44eb70f8.pth
+ Config: configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12to24_192to384_22kto1k_ft.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-large-w16_in21k-pre_3rdparty_in1k-256px
+ Metadata:
+ Training Data: ImageNet-21k
+ FLOPs: 33860000000
+ Parameters: 196750000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.93
+ Top 5 Accuracy: 98.06
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w16_in21k-pre_3rdparty_in1k-256px_20220803-c40cbed7.pth
+ Config: configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12to16_192to256_22kto1k_ft.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-large-w24_in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ Training Data: ImageNet-21k
+ FLOPs: 76200000000
+ Parameters: 196750000
+ In Collection: Swin-Transformer V2
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 87.59
+ Top 5 Accuracy: 98.27
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w24_in21k-pre_3rdparty_in1k-384px_20220803-3b36c165.pth
+ Config: configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12to24_192to384_22kto1k_ft.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-base-w12_3rdparty_in21k-192px
+ Metadata:
+ Training Data: ImageNet-21k
+ FLOPs: 8510000000
+ Parameters: 87920000
+ In Collection: Swin-Transformer V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-base-w12_3rdparty_in21k-192px_20220803-f7dc9763.pth
+ Config: configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12_192_22k.pth
+ Code: https://github.com/microsoft/Swin-Transformer
+ - Name: swinv2-large-w12_3rdparty_in21k-192px
+ Metadata:
+ Training Data: ImageNet-21k
+ FLOPs: 19040000000
+ Parameters: 196740000
+ In Collection: Swin-Transformer V2
+ Results: null
+ Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-large-w12_3rdparty_in21k-192px_20220803-d9073fee.pth
+ Config: configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py
+ Converted From:
+ Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12_192_22k.pth
+ Code: https://github.com/microsoft/Swin-Transformer
diff --git a/configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py b/configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b01b75d296dae9db97d2d85f73463f6c87c0b1c
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py
@@ -0,0 +1,19 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/base_256.py',
+ '../_base_/datasets/imagenet21k_bs128.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(img_size=192, window_size=[12, 12, 12, 6]),
+ head=dict(num_classes=21841),
+)
+
+# dataset settings
+data_preprocessor = dict(num_classes=21841)
+
+_base_['train_pipeline'][1]['scale'] = 192 # RandomResizedCrop
+_base_['test_pipeline'][1]['scale'] = 219 # ResizeEdge
+_base_['test_pipeline'][2]['crop_size'] = 192 # CenterCrop
diff --git a/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f375ee1fc9b10885f8b9d9f4794b8530c1460b5
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
@@ -0,0 +1,8 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/base_256.py',
+ '../_base_/datasets/imagenet_bs64_swin_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
diff --git a/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0725f9e739a099551a4d5b5f007bcb83708be309
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
@@ -0,0 +1,13 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/base_256.py',
+ '../_base_/datasets/imagenet_bs64_swin_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ window_size=[16, 16, 16, 8],
+ drop_path_rate=0.2,
+ pretrained_window_sizes=[12, 12, 12, 6]))
diff --git a/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py b/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..3dd4e5fd935a356d29e7790e91d4538c94711062
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
@@ -0,0 +1,14 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/base_384.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ img_size=384,
+ window_size=[24, 24, 24, 12],
+ drop_path_rate=0.2,
+ pretrained_window_sizes=[12, 12, 12, 6]))
diff --git a/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..23fc40701470f8e41252c274072896d1cd811f28
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/base_256.py',
+ '../_base_/datasets/imagenet_bs64_swin_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py b/configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b01b75d296dae9db97d2d85f73463f6c87c0b1c
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py
@@ -0,0 +1,19 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/base_256.py',
+ '../_base_/datasets/imagenet21k_bs128.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(img_size=192, window_size=[12, 12, 12, 6]),
+ head=dict(num_classes=21841),
+)
+
+# dataset settings
+data_preprocessor = dict(num_classes=21841)
+
+_base_['train_pipeline'][1]['scale'] = 192 # RandomResizedCrop
+_base_['test_pipeline'][1]['scale'] = 219 # ResizeEdge
+_base_['test_pipeline'][2]['crop_size'] = 192 # CenterCrop
diff --git a/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..62a2a29b843f197c15d8f53a7cbd1029be675fa8
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
@@ -0,0 +1,13 @@
+# Only for evaluation
+_base_ = [
+ '../_base_/models/swin_transformer_v2/large_256.py',
+ '../_base_/datasets/imagenet_bs64_swin_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ window_size=[16, 16, 16, 8], pretrained_window_sizes=[12, 12, 12, 6]),
+)
diff --git a/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py b/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..d97d9b2b869c1e0c264910859b6f980387a7b6ab
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
@@ -0,0 +1,15 @@
+# Only for evaluation
+_base_ = [
+ '../_base_/models/swin_transformer_v2/large_384.py',
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ img_size=384,
+ window_size=[24, 24, 24, 12],
+ pretrained_window_sizes=[12, 12, 12, 6]),
+)
diff --git a/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f87265dd199c712a6442407db852b5d4b6aabd7d
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
@@ -0,0 +1,8 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/small_256.py',
+ '../_base_/datasets/imagenet_bs64_swin_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
diff --git a/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1001f1b6e1978c3706ca6183f863c316b13ade4
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/small_256.py',
+ '../_base_/datasets/imagenet_bs64_swin_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e1f290f371e1b9084f4cd5291e1e638d0ad54e3
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
@@ -0,0 +1,8 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/tiny_256.py',
+ '../_base_/datasets/imagenet_bs64_swin_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
diff --git a/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cdc9a25ae8a64758f8642c079e1ff7fbf0548c3
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/swin_transformer_v2/tiny_256.py',
+ '../_base_/datasets/imagenet_bs64_swin_256.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
diff --git a/configs/t2t_vit/README.md b/configs/t2t_vit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bf0967cf27f606788174bc9fc2198cad3dbfced6
--- /dev/null
+++ b/configs/t2t_vit/README.md
@@ -0,0 +1,81 @@
+# Tokens-to-Token ViT
+
+> [Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet](https://arxiv.org/abs/2101.11986)
+
+
+
+## Abstract
+
+Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384×384 on ImageNet.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('t2t-vit-t-14_8xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('t2t-vit-t-14_8xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `t2t-vit-t-14_8xb64_in1k` | From scratch | 21.47 | 4.34 | 81.83 | 95.84 | [config](t2t-vit-t-14_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.json) |
+| `t2t-vit-t-19_8xb64_in1k` | From scratch | 39.08 | 7.80 | 82.63 | 96.18 | [config](t2t-vit-t-19_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.json) |
+| `t2t-vit-t-24_8xb64_in1k` | From scratch | 64.00 | 12.69 | 82.71 | 96.09 | [config](t2t-vit-t-24_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.json) |
+
+## Citation
+
+```bibtex
+@article{yuan2021tokens,
+ title={Tokens-to-token vit: Training vision transformers from scratch on imagenet},
+ author={Yuan, Li and Chen, Yunpeng and Wang, Tao and Yu, Weihao and Shi, Yujun and Tay, Francis EH and Feng, Jiashi and Yan, Shuicheng},
+ journal={arXiv preprint arXiv:2101.11986},
+ year={2021}
+}
+```
diff --git a/configs/t2t_vit/metafile.yml b/configs/t2t_vit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..72cb2dfc92899779846af6263a125d028d17d1b2
--- /dev/null
+++ b/configs/t2t_vit/metafile.yml
@@ -0,0 +1,58 @@
+Collections:
+ - Name: Tokens-to-Token ViT
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Layer Normalization
+ - Scaled Dot-Product Attention
+ - Attention Dropout
+ - Dropout
+ - Tokens to Token
+ Paper:
+ URL: https://arxiv.org/abs/2101.11986
+ Title: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet"
+ README: configs/t2t_vit/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.17.0/mmcls/models/backbones/t2t_vit.py
+ Version: v0.17.0
+
+Models:
+ - Name: t2t-vit-t-14_8xb64_in1k
+ Metadata:
+ FLOPs: 4340000000
+ Parameters: 21470000
+ In Collection: Tokens-to-Token ViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.83
+ Top 5 Accuracy: 95.84
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth
+ Config: configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
+ - Name: t2t-vit-t-19_8xb64_in1k
+ Metadata:
+ FLOPs: 7800000000
+ Parameters: 39080000
+ In Collection: Tokens-to-Token ViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.63
+ Top 5 Accuracy: 96.18
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.pth
+ Config: configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
+ - Name: t2t-vit-t-24_8xb64_in1k
+ Metadata:
+ FLOPs: 12690000000
+ Parameters: 64000000
+ In Collection: Tokens-to-Token ViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.71
+ Top 5 Accuracy: 96.09
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.pth
+ Config: configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
diff --git a/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py b/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ff6444548c4be59f52bc2aa259e7aaac32dea3d
--- /dev/null
+++ b/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
@@ -0,0 +1,49 @@
+_base_ = [
+ '../_base_/models/t2t-vit-t-14.py',
+ '../_base_/datasets/imagenet_bs64_t2t_224.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.05),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={'cls_token': dict(decay_mult=0.0)},
+ ),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-6,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=290,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=10,
+ end=300),
+ # cool down learning rate scheduler
+ dict(type='ConstantLR', factor=0.1, by_epoch=True, begin=300, end=310),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=310, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py b/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c7275372f904a4d53453b37bb50bfd31edb842f
--- /dev/null
+++ b/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
@@ -0,0 +1,49 @@
+_base_ = [
+ '../_base_/models/t2t-vit-t-19.py',
+ '../_base_/datasets/imagenet_bs64_t2t_224.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.065),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={'cls_token': dict(decay_mult=0.0)},
+ ),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-6,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=290,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=10,
+ end=300),
+ # cool down learning rate scheduler
+ dict(type='ConstantLR', factor=0.1, by_epoch=True, begin=300, end=310),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=310, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py b/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e180ff344bd88808e635f3004704c6079a03465b
--- /dev/null
+++ b/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
@@ -0,0 +1,49 @@
+_base_ = [
+ '../_base_/models/t2t-vit-t-24.py',
+ '../_base_/datasets/imagenet_bs64_t2t_224.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.065),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={'cls_token': dict(decay_mult=0.0)},
+ ),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-6,
+ by_epoch=True,
+ begin=0,
+ end=10,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=290,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=10,
+ end=300),
+ # cool down learning rate scheduler
+ dict(type='ConstantLR', factor=0.1, by_epoch=True, begin=300, end=310),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=310, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/tinyvit/README.md b/configs/tinyvit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..58ceb5779b474a9818843cec0d34e8fc8f178f4b
--- /dev/null
+++ b/configs/tinyvit/README.md
@@ -0,0 +1,82 @@
+# TinyViT
+
+> [TinyViT: Fast Pretraining Distillation for Small Vision Transformers](https://arxiv.org/abs/2207.10666)
+
+
+
+## Abstract
+
+Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to SwinB pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('tinyvit-5m_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('tinyvit-5m_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/tinyvit/tinyvit-5m_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_3rdparty_in1k_20221021-62cb5abf.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :--------------------------------------------- | :------------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------------: | :------------------------------------------------: |
+| `tinyvit-5m_3rdparty_in1k`\* | From scratch | 5.39 | 1.29 | 79.02 | 94.74 | [config](tinyvit-5m_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_3rdparty_in1k_20221021-62cb5abf.pth) |
+| `tinyvit-5m_in21k-distill-pre_3rdparty_in1k`\* | ImageNet-21k DISTILL | 5.39 | 1.29 | 80.71 | 95.57 | [config](tinyvit-5m-distill_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_in21k-distill-pre_3rdparty_in1k_20221021-d4b010a8.pth) |
+| `tinyvit-11m_3rdparty_in1k`\* | From scratch | 11.00 | 2.05 | 81.44 | 95.79 | [config](tinyvit-11m_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_3rdparty_in1k_20221021-11ccef16.pth) |
+| `tinyvit-11m_in21k-distill-pre_3rdparty_in1k`\* | ImageNet-21k DISTILL | 11.00 | 2.05 | 83.19 | 96.53 | [config](tinyvit-11m-distill_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_in21k-distill-pre_3rdparty_in1k_20221021-5d3bc0dc.pth) |
+| `tinyvit-21m_3rdparty_in1k`\* | From scratch | 21.20 | 4.30 | 83.08 | 96.58 | [config](tinyvit-21m_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_3rdparty_in1k_20221021-5346ba34.pth) |
+| `tinyvit-21m_in21k-distill-pre_3rdparty_in1k`\* | ImageNet-21k DISTILL | 21.20 | 4.30 | 84.85 | 97.27 | [config](tinyvit-21m-distill_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k_20221021-3d9b30a2.pth) |
+| `tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px`\* | ImageNet-21k DISTILL | 21.23 | 13.85 | 86.21 | 97.77 | [config](tinyvit-21m-distill_8xb256_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px_20221021-65be6b3f.pth) |
+| `tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px`\* | ImageNet-21k DISTILL | 21.27 | 27.15 | 86.44 | 97.89 | [config](tinyvit-21m-distill_8xb256_in1k-512px.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px_20221021-e42a9bea.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Cream/tree/main/TinyViT). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@InProceedings{tiny_vit,
+ title={TinyViT: Fast Pretraining Distillation for Small Vision Transformers},
+ author={Wu, Kan and Zhang, Jinnian and Peng, Houwen and Liu, Mengchen and Xiao, Bin and Fu, Jianlong and Yuan, Lu},
+ booktitle={European conference on computer vision (ECCV)},
+ year={2022}
+}
+```
diff --git a/configs/tinyvit/metafile.yml b/configs/tinyvit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..a1c5438acb9eba87f7a5e8c02356459c1194d74a
--- /dev/null
+++ b/configs/tinyvit/metafile.yml
@@ -0,0 +1,162 @@
+Collections:
+ - Name: TinyViT
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - MBConv
+ - Window Multi-head Self-Attention
+ Paper:
+ Title: 'TinyViT: Fast Pretraining Distillation for Small Vision Transformers'
+ URL: https://arxiv.org/abs/2207.10666
+ README: configs/tinyvit/README.md
+ Code:
+ Version: v1.0.0rc1
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.2/mmcls/models/backbones/tinyvit.py
+
+Models:
+ - Name: tinyvit-5m_3rdparty_in1k
+ Metadata:
+ FLOPs: 1286655360
+ Parameters: 5392764
+ Training Data: ImageNet-1k
+ In Collection: TinyViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.02
+ Top 5 Accuracy: 94.74
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_3rdparty_in1k_20221021-62cb5abf.pth
+ Config: configs/tinyvit/tinyvit-5m_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_5m_1k.pth
+ Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+ - Name: tinyvit-5m_in21k-distill-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 1286655360
+ Parameters: 5392764
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: TinyViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.71
+ Top 5 Accuracy: 95.57
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_in21k-distill-pre_3rdparty_in1k_20221021-d4b010a8.pth
+ Config: configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_5m_22kto1k_distill.pth
+ Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+ - Name: tinyvit-11m_3rdparty_in1k
+ Metadata:
+ FLOPs: 2050033664
+ Parameters: 10996972
+ Training Data: ImageNet-1k
+ In Collection: TinyViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.44
+ Top 5 Accuracy: 95.79
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_3rdparty_in1k_20221021-11ccef16.pth
+ Config: configs/tinyvit/tinyvit-11m_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_11m_1k.pth
+ Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+ - Name: tinyvit-11m_in21k-distill-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 2050033664
+ Parameters: 10996972
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: TinyViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.19
+ Top 5 Accuracy: 96.53
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_in21k-distill-pre_3rdparty_in1k_20221021-5d3bc0dc.pth
+ Config: configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_11m_22kto1k_distill.pth
+ Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+ - Name: tinyvit-21m_3rdparty_in1k
+ Metadata:
+ FLOPs: 4301124096
+ Parameters: 21198568
+ Training Data: ImageNet-1k
+ In Collection: TinyViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.08
+ Top 5 Accuracy: 96.58
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_3rdparty_in1k_20221021-5346ba34.pth
+ Config: configs/tinyvit/tinyvit-21m_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_1k.pth
+ Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+ - Name: tinyvit-21m_in21k-distill-pre_3rdparty_in1k
+ Metadata:
+ FLOPs: 4301124096
+ Parameters: 21198568
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: TinyViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.85
+ Top 5 Accuracy: 97.27
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k_20221021-3d9b30a2.pth
+ Config: configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py
+ Converted From:
+ Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_22kto1k_distill.pth
+ Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+ - Name: tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 13848250176
+ Parameters: 21230488
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: TinyViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.21
+ Top 5 Accuracy: 97.77
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px_20221021-65be6b3f.pth
+ Config: configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py
+ Converted From:
+ Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_22kto1k_384_distill.pth
+ Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+ - Name: tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px
+ Metadata:
+ FLOPs: 27151420224
+ Parameters: 21268120
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: TinyViT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.44
+ Top 5 Accuracy: 97.89
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px_20221021-e42a9bea.pth
+ Config: configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py
+ Converted From:
+ Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_22kto1k_512_distill.pth
+ Code: https://github.com/microsoft/Cream/tree/main/TinyViT
diff --git a/configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py b/configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..145feb9aa65baf4bba947cdebb6e8dad5b9781f5
--- /dev/null
+++ b/configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = [
+ './tinyvit-11m_8xb256_in1k.py',
+]
diff --git a/configs/tinyvit/tinyvit-11m_8xb256_in1k.py b/configs/tinyvit/tinyvit-11m_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3acfa86a0d5fa24aae44c01064c49f5348d7da3
--- /dev/null
+++ b/configs/tinyvit/tinyvit-11m_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+ '../_base_/models/tinyvit/tinyvit-11m.py',
+]
diff --git a/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e51b1930dd96c987dd4eab9dd77d0e068c801c
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py
@@ -0,0 +1,29 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+ '../_base_/models/tinyvit/tinyvit-21m.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(
+ img_size=(384, 384),
+ window_size=[12, 12, 24, 12],
+ drop_path_rate=0.1,
+ ))
+
+# data settings
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(384, 384),
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='PackInputs'),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py
new file mode 100644
index 0000000000000000000000000000000000000000..05b47c6de94868a6df6ec95cd406095dfc80153e
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py
@@ -0,0 +1,28 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+ '../_base_/models/tinyvit/tinyvit-21m.py',
+]
+
+# model settings
+model = dict(
+ backbone=dict(
+ img_size=(512, 512),
+ window_size=[16, 16, 32, 16],
+ drop_path_rate=0.1,
+ ))
+# data settings
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='Resize',
+ scale=(512, 512),
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='PackInputs'),
+]
+
+val_dataloader = dict(batch_size=16, dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..53885852757c6dce993addb6772b7d6e98219d81
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = [
+ './tinyvit-21m_8xb256_in1k.py',
+]
diff --git a/configs/tinyvit/tinyvit-21m_8xb256_in1k.py b/configs/tinyvit/tinyvit-21m_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c12019c9cf0babe49b24a21fa74fc66d33dda91
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+ '../_base_/models/tinyvit/tinyvit-21m.py',
+]
diff --git a/configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py b/configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0003c30ac46d2dbe2069733a17b039133b95ae8a
--- /dev/null
+++ b/configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = [
+ './tinyvit-5m_8xb256_in1k.py',
+]
diff --git a/configs/tinyvit/tinyvit-5m_8xb256_in1k.py b/configs/tinyvit/tinyvit-5m_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..262b5a469c4daa7ed135e466e872bb57e0f1f148
--- /dev/null
+++ b/configs/tinyvit/tinyvit-5m_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+ '../_base_/models/tinyvit/tinyvit-5m.py',
+]
diff --git a/configs/tnt/README.md b/configs/tnt/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e86da0b4a8d31a09b6f41e99cff4c233e67a114a
--- /dev/null
+++ b/configs/tnt/README.md
@@ -0,0 +1,77 @@
+# Transformer in Transformer
+
+> [Transformer in Transformer](https://arxiv.org/abs/2103.00112)
+
+
+
+## Abstract
+
+Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16×16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4×4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('tnt-small-p16_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('tnt-small-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/tnt/tnt-s-p16_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------: | :------------------------------------------------------------------------------------: |
+| `tnt-small-p16_3rdparty_in1k`\* | From scratch | 23.76 | 3.36 | 81.52 | 95.73 | [config](tnt-s-p16_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/contrastive/pytorch-image-models/blob/809271b0f3e5d9be4e11c0c5cec1dbba8b5e2c60/timm/models/tnt.py#L144). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{han2021transformer,
+ title={Transformer in Transformer},
+ author={Kai Han and An Xiao and Enhua Wu and Jianyuan Guo and Chunjing Xu and Yunhe Wang},
+ year={2021},
+ eprint={2103.00112},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
diff --git a/configs/tnt/metafile.yml b/configs/tnt/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..dcc2eddb5f479b987767802447cd46fa2a6383bb
--- /dev/null
+++ b/configs/tnt/metafile.yml
@@ -0,0 +1,29 @@
+Collections:
+ - Name: Transformer in Transformer
+ Metadata:
+ Training Data: ImageNet-1k
+ Paper:
+ URL: https://arxiv.org/abs/2103.00112
+ Title: "Transformer in Transformer"
+ README: configs/tnt/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/tnt.py#L203
+ Version: v0.15.0
+
+Models:
+ - Name: tnt-small-p16_3rdparty_in1k
+ Metadata:
+ FLOPs: 3360000000
+ Parameters: 23760000
+ In Collection: Transformer in Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.52
+ Top 5 Accuracy: 95.73
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth
+ Config: configs/tnt/tnt-s-p16_16xb64_in1k.py
+ Converted From:
+ Weights: https://github.com/contrastive/pytorch-image-models/releases/download/TNT/tnt_s_patch16_224.pth.tar
+ Code: https://github.com/contrastive/pytorch-image-models/blob/809271b0f3e5d9be4e11c0c5cec1dbba8b5e2c60/timm/models/tnt.py#L144
diff --git a/configs/tnt/tnt-s-p16_16xb64_in1k.py b/configs/tnt/tnt-s-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..af71232f831089a934d14beb4b187432661921ae
--- /dev/null
+++ b/configs/tnt/tnt-s-p16_16xb64_in1k.py
@@ -0,0 +1,56 @@
+# accuracy_top-1 : 81.52 accuracy_top-5 : 95.73
+_base_ = [
+ '../_base_/models/tnt_s_patch16_224.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset settings
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=248,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=64)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-3, weight_decay=0.05))
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', T_max=295, by_epoch=True, begin=5, end=300)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (16 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/twins/README.md b/configs/twins/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9e97b7842d9ddb8ab12d13283fb3ed50ed172f70
--- /dev/null
+++ b/configs/twins/README.md
@@ -0,0 +1,80 @@
+# Twins
+
+> [Twins: Revisiting the Design of Spatial Attention in Vision Transformers](http://arxiv-export-lb.library.cornell.edu/abs/2104.13840)
+
+
+
+## Abstract
+
+Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks, including image level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is released at [this https URL](https://github.com/Meituan-AutoML/Twins).
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('twins-pcpvt-small_3rdparty_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('twins-pcpvt-small_3rdparty_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/twins/twins-pcpvt-small_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-small_3rdparty_8xb128_in1k_20220126-ef23c132.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :-----------------------------------------------------------------: |
+| `twins-pcpvt-small_3rdparty_8xb128_in1k`\* | From scratch | 24.11 | 3.67 | 81.14 | 95.69 | [config](twins-pcpvt-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-small_3rdparty_8xb128_in1k_20220126-ef23c132.pth) |
+| `twins-pcpvt-base_3rdparty_8xb128_in1k`\* | From scratch | 43.83 | 6.45 | 82.66 | 96.26 | [config](twins-pcpvt-base_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-base_3rdparty_8xb128_in1k_20220126-f8c4b0d5.pth) |
+| `twins-pcpvt-large_3rdparty_16xb64_in1k`\* | From scratch | 60.99 | 9.51 | 83.09 | 96.59 | [config](twins-pcpvt-large_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-large_3rdparty_16xb64_in1k_20220126-c1ef8d80.pth) |
+| `twins-svt-small_3rdparty_8xb128_in1k`\* | From scratch | 24.06 | 2.82 | 81.77 | 95.57 | [config](twins-svt-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-small_3rdparty_8xb128_in1k_20220126-8fe5205b.pth) |
+| `twins-svt-base_8xb128_3rdparty_in1k`\* | From scratch | 56.07 | 8.35 | 83.13 | 96.29 | [config](twins-svt-base_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-base_3rdparty_8xb128_in1k_20220126-e31cc8e9.pth) |
+| `twins-svt-large_3rdparty_16xb64_in1k`\* | From scratch | 99.27 | 14.82 | 83.60 | 96.50 | [config](twins-svt-large_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-large_3rdparty_16xb64_in1k_20220126-4817645f.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{chu2021twins,
+ title={Twins: Revisiting spatial attention design in vision transformers},
+ author={Chu, Xiangxiang and Tian, Zhi and Wang, Yuqing and Zhang, Bo and Ren, Haibing and Wei, Xiaolin and Xia, Huaxia and Shen, Chunhua},
+ journal={arXiv preprint arXiv:2104.13840},
+ year={2021}altgvt
+}
+```
diff --git a/configs/twins/metafile.yml b/configs/twins/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..d0d8ff4a324b86865b711b48d769a1f8fdb9130c
--- /dev/null
+++ b/configs/twins/metafile.yml
@@ -0,0 +1,114 @@
+Collections:
+ - Name: Twins
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Global Subsampled Attention
+ - Locally Grouped SelfAttention
+ - Conditional Position Encoding
+ - Pyramid Vision Transformer
+ Paper:
+ URL: http://arxiv-export-lb.library.cornell.edu/abs/2104.13840
+ Title: "Twins: Revisiting the Design of Spatial Attention in Vision Transformers"
+ README: configs/twins/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/twins.py
+ Version: v0.20.1
+
+Models:
+ - Name: twins-pcpvt-small_3rdparty_8xb128_in1k
+ Metadata:
+ FLOPs: 3670000000 # 3.67G
+ Parameters: 24110000 # 24.11M
+ In Collection: Twins
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.14
+ Top 5 Accuracy: 95.69
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-small_3rdparty_8xb128_in1k_20220126-ef23c132.pth
+ Config: configs/twins/twins-pcpvt-small_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+ - Name: twins-pcpvt-base_3rdparty_8xb128_in1k
+ Metadata:
+ FLOPs: 6450000000 # 6.45G
+ Parameters: 43830000 # 43.83M
+ In Collection: Twins
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.66
+ Top 5 Accuracy: 96.26
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-base_3rdparty_8xb128_in1k_20220126-f8c4b0d5.pth
+ Config: configs/twins/twins-pcpvt-base_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+ - Name: twins-pcpvt-large_3rdparty_16xb64_in1k
+ Metadata:
+ FLOPs: 9510000000 # 9.51G
+ Parameters: 60990000 # 60.99M
+ In Collection: Twins
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.09
+ Top 5 Accuracy: 96.59
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-large_3rdparty_16xb64_in1k_20220126-c1ef8d80.pth
+ Config: configs/twins/twins-pcpvt-large_16xb64_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+ - Name: twins-svt-small_3rdparty_8xb128_in1k
+ Metadata:
+ FLOPs: 2820000000 # 2.82G
+ Parameters: 24060000 # 24.06M
+ In Collection: Twins
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.77
+ Top 5 Accuracy: 95.57
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-small_3rdparty_8xb128_in1k_20220126-8fe5205b.pth
+ Config: configs/twins/twins-svt-small_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+ - Name: twins-svt-base_8xb128_3rdparty_in1k
+ Metadata:
+ FLOPs: 8350000000 # 8.35G
+ Parameters: 56070000 # 56.07M
+ In Collection: Twins
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.13
+ Top 5 Accuracy: 96.29
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-base_3rdparty_8xb128_in1k_20220126-e31cc8e9.pth
+ Config: configs/twins/twins-svt-base_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+ - Name: twins-svt-large_3rdparty_16xb64_in1k
+ Metadata:
+ FLOPs: 14820000000 # 14.82G
+ Parameters: 99270000 # 99.27M
+ In Collection: Twins
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.60
+ Top 5 Accuracy: 96.50
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-large_3rdparty_16xb64_in1k_20220126-4817645f.pth
+ Config: configs/twins/twins-svt-large_16xb64_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
diff --git a/configs/twins/twins-pcpvt-base_8xb128_in1k.py b/configs/twins/twins-pcpvt-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ac5d2adf15e4c71af8cff09a59acaa9d863f9a7
--- /dev/null
+++ b/configs/twins/twins-pcpvt-base_8xb128_in1k.py
@@ -0,0 +1,41 @@
+_base_ = [
+ '../_base_/models/twins_pcpvt_base.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW',
+ lr=5e-4 * 128 * 8 / 512, # learning rate for 128 batch size, 8 gpu.
+ weight_decay=0.05,
+ eps=1e-8,
+ betas=(0.9, 0.999)),
+ paramwise_cfg=dict(_delete=True, norm_decay_mult=0.0, bias_decay_mult=0.0),
+ clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=295,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=5,
+ end=300)
+]
diff --git a/configs/twins/twins-pcpvt-large_16xb64_in1k.py b/configs/twins/twins-pcpvt-large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0acfd7528b5c17ece73586df3ce7dc850ea5a64a
--- /dev/null
+++ b/configs/twins/twins-pcpvt-large_16xb64_in1k.py
@@ -0,0 +1,7 @@
+_base_ = ['twins-pcpvt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='large'), head=dict(in_channels=512))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
diff --git a/configs/twins/twins-pcpvt-small_8xb128_in1k.py b/configs/twins/twins-pcpvt-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fe763b77754bf249030d48459302e532900a1a3
--- /dev/null
+++ b/configs/twins/twins-pcpvt-small_8xb128_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['twins-pcpvt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='small'), head=dict(in_channels=512))
diff --git a/configs/twins/twins-svt-base_8xb128_in1k.py b/configs/twins/twins-svt-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d24f63b074afe59574d04e40f8379ec6c386baa
--- /dev/null
+++ b/configs/twins/twins-svt-base_8xb128_in1k.py
@@ -0,0 +1,41 @@
+_base_ = [
+ '../_base_/models/twins_svt_base.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW',
+ lr=5e-4 * 128 * 8 / 512, # learning rate for 128 batch size, 8 gpu.
+ weight_decay=0.05,
+ eps=1e-8,
+ betas=(0.9, 0.999)),
+ paramwise_cfg=dict(_delete=True, norm_decay_mult=0.0, bias_decay_mult=0.0),
+ clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+ # warm up learning rate scheduler
+ dict(
+ type='LinearLR',
+ start_factor=1e-3,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ # update by iter
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(
+ type='CosineAnnealingLR',
+ T_max=295,
+ eta_min=1e-5,
+ by_epoch=True,
+ begin=5,
+ end=300)
+]
diff --git a/configs/twins/twins-svt-large_16xb64_in1k.py b/configs/twins/twins-svt-large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a1eba894e5f831376ad7c5871434db438db59b
--- /dev/null
+++ b/configs/twins/twins-svt-large_16xb64_in1k.py
@@ -0,0 +1,7 @@
+_base_ = ['twins-svt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='large'), head=dict(in_channels=1024))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
diff --git a/configs/twins/twins-svt-small_8xb128_in1k.py b/configs/twins/twins-svt-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ffe267b56e921abcdcc40c833bba42e9952a4d4
--- /dev/null
+++ b/configs/twins/twins-svt-small_8xb128_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['twins-svt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='small'), head=dict(in_channels=512))
diff --git a/configs/van/README.md b/configs/van/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7e548b6b8003169602ea6a205c2c305b8808ed39
--- /dev/null
+++ b/configs/van/README.md
@@ -0,0 +1,78 @@
+# Visual-Attention-Network
+
+> [Visual Attention Network](https://arxiv.org/abs/2202.09741)
+
+
+
+## Abstract
+
+While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('van-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('van-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/van/van-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :-------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------: | :----------------------------------------------------------------------------------------: |
+| `van-tiny_3rdparty_in1k`\* | From scratch | 4.11 | 0.88 | 75.41 | 93.02 | [config](van-tiny_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth) |
+| `van-small_3rdparty_in1k`\* | From scratch | 13.86 | 2.52 | 81.01 | 95.63 | [config](van-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/van/van-small_8xb128_in1k_20220501-17bc91aa.pth) |
+| `van-base_3rdparty_in1k`\* | From scratch | 26.58 | 5.03 | 82.80 | 96.21 | [config](van-base_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/van/van-base_8xb128_in1k_20220501-6a4cc31b.pth) |
+| `van-large_3rdparty_in1k`\* | From scratch | 44.77 | 8.99 | 83.86 | 96.73 | [config](van-large_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/van/van-large_8xb128_in1k_20220501-f212ba21.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Visual-Attention-Network/VAN-Classification). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{guo2022visual,
+ title={Visual Attention Network},
+ author={Guo, Meng-Hao and Lu, Cheng-Ze and Liu, Zheng-Ning and Cheng, Ming-Ming and Hu, Shi-Min},
+ journal={arXiv preprint arXiv:2202.09741},
+ year={2022}
+}
+```
diff --git a/configs/van/metafile.yml b/configs/van/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..db5a6e6443c13a1eb9dc669923d8c0902e89ee7a
--- /dev/null
+++ b/configs/van/metafile.yml
@@ -0,0 +1,82 @@
+Collections:
+ - Name: Visual-Attention-Network
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - AdamW
+ - Weight Decay
+ Architecture:
+ - Visual Attention Network
+ Paper:
+ URL: https://arxiv.org/abs/2202.09741
+ Title: "Visual Attention Network"
+ README: configs/van/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.0/mmcls/models/backbones/van.py
+ Version: v0.23.0
+
+Models:
+ - Name: van-tiny_3rdparty_in1k
+ Metadata:
+ Parameters: 4110000 # 4.11M
+ FLOPs: 880000000 # 0.88G
+ In Collection: Visual-Attention-Network
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 75.41
+ Top 5 Accuracy: 93.02
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth
+ Config: configs/van/van-tiny_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/Visual-Attention-Network/VAN-Classification
+ Weights: https://cloud.tsinghua.edu.cn/f/aada2242a16245d6a561/?dl=1
+ - Name: van-small_3rdparty_in1k
+ Metadata:
+ Parameters: 13860000 # 13.86M
+ FLOPs: 2520000000 # 2.52G
+ In Collection: Visual-Attention-Network
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.01
+ Top 5 Accuracy: 95.63
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/van/van-small_8xb128_in1k_20220501-17bc91aa.pth
+ Config: configs/van/van-small_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/Visual-Attention-Network/VAN-Classification
+ Weights: https://cloud.tsinghua.edu.cn/f/dd3eb73692f74a2499c9/?dl=1
+ - Name: van-base_3rdparty_in1k
+ Metadata:
+ Parameters: 26580000 # 26.58M
+ FLOPs: 5030000000 # 5.03G
+ In Collection: Visual-Attention-Network
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.80
+ Top 5 Accuracy: 96.21
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/van/van-base_8xb128_in1k_20220501-6a4cc31b.pth
+ Config: configs/van/van-base_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/Visual-Attention-Network/VAN-Classification
+ Weights: https://cloud.tsinghua.edu.cn/f/58e7acceaf334ecdba89/?dl=1
+ - Name: van-large_3rdparty_in1k
+ Metadata:
+ Parameters: 44770000 # 44.77 M
+ FLOPs: 8990000000 # 8.99G
+ In Collection: Visual-Attention-Network
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.86
+ Top 5 Accuracy: 96.73
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/van/van-large_8xb128_in1k_20220501-f212ba21.pth
+ Config: configs/van/van-large_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/Visual-Attention-Network/VAN-Classification
+ Weights: https://cloud.tsinghua.edu.cn/f/0201745f6920482490a0/?dl=1
diff --git a/configs/van/van-base_8xb128_in1k.py b/configs/van/van-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..47082b748554eea9dfc467f63a5644294131fd14
--- /dev/null
+++ b/configs/van/van-base_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+ '../_base_/models/van/van_base.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset setting
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=248,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/van/van-large_8xb128_in1k.py b/configs/van/van-large_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b16567726222306eff4a28ef76361922ecf28970
--- /dev/null
+++ b/configs/van/van-large_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+ '../_base_/models/van/van_large.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=248,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/van/van-small_8xb128_in1k.py b/configs/van/van-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbbbbdf4c8b7441a19c00c44f012478b1021335a
--- /dev/null
+++ b/configs/van/van-small_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+ '../_base_/models/van/van_small.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=248,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/van/van-tiny_8xb128_in1k.py b/configs/van/van-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ac62dab083c5c42dfd532f9191f01c74fcc9408
--- /dev/null
+++ b/configs/van/van-tiny_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+ '../_base_/models/van/van_tiny.py',
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# dataset setting
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=248,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/vgg/README.md b/configs/vgg/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7af69ce6b87d1ce989881fa17bf5c6cacc3748be
--- /dev/null
+++ b/configs/vgg/README.md
@@ -0,0 +1,86 @@
+# VGG
+
+> [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556)
+
+
+
+## Abstract
+
+In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vgg11_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vgg11_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/vgg/vgg11_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/vgg/vgg11_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------: | :--------------------------------------------------------------------------------------------------: |
+| `vgg11_8xb32_in1k` | From scratch | 132.86 | 7.63 | 68.75 | 88.87 | [config](vgg11_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.json) |
+| `vgg13_8xb32_in1k` | From scratch | 133.05 | 11.34 | 70.02 | 89.46 | [config](vgg13_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080.json) |
+| `vgg16_8xb32_in1k` | From scratch | 138.36 | 15.50 | 71.62 | 90.49 | [config](vgg16_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.json) |
+| `vgg19_8xb32_in1k` | From scratch | 143.67 | 19.67 | 72.41 | 90.80 | [config](vgg19_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a.json) |
+| `vgg11bn_8xb32_in1k` | From scratch | 132.87 | 7.64 | 70.67 | 90.16 | [config](vgg11bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c.json) |
+| `vgg13bn_8xb32_in1k` | From scratch | 133.05 | 11.36 | 72.12 | 90.66 | [config](vgg13bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864.json) |
+| `vgg16bn_8xb32_in1k` | From scratch | 138.37 | 15.53 | 73.74 | 91.66 | [config](vgg16bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29.json) |
+| `vgg19bn_8xb32_in1k` | From scratch | 143.68 | 19.70 | 74.68 | 92.27 | [config](vgg19bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f.json) |
+
+## Citation
+
+```bibtex
+@article{simonyan2014very,
+ title={Very deep convolutional networks for large-scale image recognition},
+ author={Simonyan, Karen and Zisserman, Andrew},
+ journal={arXiv preprint arXiv:1409.1556},
+ year={2014}
+}
+```
diff --git a/configs/vgg/metafile.yml b/configs/vgg/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..ce3af191a746878f7d9b6febf67cc6c96a5fa8c1
--- /dev/null
+++ b/configs/vgg/metafile.yml
@@ -0,0 +1,125 @@
+Collections:
+ - Name: VGG
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x Xp GPUs
+ Epochs: 100
+ Batch Size: 256
+ Architecture:
+ - VGG
+ Paper:
+ URL: https://arxiv.org/abs/1409.1556
+ Title: "Very Deep Convolutional Networks for Large-Scale Image Recognition"
+ README: configs/vgg/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/vgg.py#L39
+ Version: v0.15.0
+
+Models:
+ - Name: vgg11_8xb32_in1k
+ Metadata:
+ FLOPs: 7630000000
+ Parameters: 132860000
+ In Collection: VGG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 68.75
+ Top 5 Accuracy: 88.87
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth
+ Config: configs/vgg/vgg11_8xb32_in1k.py
+ - Name: vgg13_8xb32_in1k
+ Metadata:
+ FLOPs: 11340000000
+ Parameters: 133050000
+ In Collection: VGG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 70.02
+ Top 5 Accuracy: 89.46
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080.pth
+ Config: configs/vgg/vgg13_8xb32_in1k.py
+ - Name: vgg16_8xb32_in1k
+ Metadata:
+ FLOPs: 15500000000
+ Parameters: 138360000
+ In Collection: VGG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 71.62
+ Top 5 Accuracy: 90.49
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.pth
+ Config: configs/vgg/vgg16_8xb32_in1k.py
+ - Name: vgg19_8xb32_in1k
+ Metadata:
+ FLOPs: 19670000000
+ Parameters: 143670000
+ In Collection: VGG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 72.41
+ Top 5 Accuracy: 90.8
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a.pth
+ Config: configs/vgg/vgg19_8xb32_in1k.py
+ - Name: vgg11bn_8xb32_in1k
+ Metadata:
+ FLOPs: 7640000000
+ Parameters: 132870000
+ In Collection: VGG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 70.67
+ Top 5 Accuracy: 90.16
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c.pth
+ Config: configs/vgg/vgg11bn_8xb32_in1k.py
+ - Name: vgg13bn_8xb32_in1k
+ Metadata:
+ FLOPs: 11360000000
+ Parameters: 133050000
+ In Collection: VGG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 72.12
+ Top 5 Accuracy: 90.66
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864.pth
+ Config: configs/vgg/vgg13bn_8xb32_in1k.py
+ - Name: vgg16bn_8xb32_in1k
+ Metadata:
+ FLOPs: 15530000000
+ Parameters: 138370000
+ In Collection: VGG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 73.74
+ Top 5 Accuracy: 91.66
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29.pth
+ Config: configs/vgg/vgg16bn_8xb32_in1k.py
+ - Name: vgg19bn_8xb32_in1k
+ Metadata:
+ FLOPs: 19700000000
+ Parameters: 143680000
+ In Collection: VGG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.68
+ Top 5 Accuracy: 92.27
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f.pth
+ Config: configs/vgg/vgg19bn_8xb32_in1k.py
diff --git a/configs/vgg/vgg11_8xb32_in1k.py b/configs/vgg/vgg11_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..616233c418fdeaa5d08db75b290f3438ec96b13c
--- /dev/null
+++ b/configs/vgg/vgg11_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/vgg11.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg11bn_8xb32_in1k.py b/configs/vgg/vgg11bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..22f55ef0851ee4728caad271cfdaf02fb5c4afed
--- /dev/null
+++ b/configs/vgg/vgg11bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vgg11bn.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vgg/vgg13_8xb32_in1k.py b/configs/vgg/vgg13_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec1c98fb997568754868670a0f9d37233e6ca57d
--- /dev/null
+++ b/configs/vgg/vgg13_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/vgg13.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg13bn_8xb32_in1k.py b/configs/vgg/vgg13bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3cb3592b09e06e1b902c6d1fcca2cb03bcb7f82c
--- /dev/null
+++ b/configs/vgg/vgg13bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vgg13bn.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vgg/vgg16_8xb16_voc.py b/configs/vgg/vgg16_8xb16_voc.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d9e347bf533f36eb165dd06d0faf20ccbaba917
--- /dev/null
+++ b/configs/vgg/vgg16_8xb16_voc.py
@@ -0,0 +1,43 @@
+_base_ = [
+ '../_base_/datasets/voc_bs16.py',
+ '../_base_/default_runtime.py',
+]
+
+# model settings
+
+# load model pretrained on imagenet
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.pth' # noqa
+
+# use different head for multilabel task
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VGG',
+ depth=16,
+ num_classes=20,
+ init_cfg=dict(
+ type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+ neck=None,
+ head=dict(
+ type='MultiLabelClsHead',
+ loss=dict(type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)))
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.001, momentum=0.9, weight_decay=0),
+ # update the final linear by 10 times learning rate.
+ paramwise_cfg=dict(custom_keys={'.backbone.classifier': dict(lr_mult=10)}),
+)
+
+# learning policy
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=20, gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=40, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (16 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/vgg/vgg16_8xb32_in1k.py b/configs/vgg/vgg16_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a291da2813f011323f7ba19724dc92d87b935f80
--- /dev/null
+++ b/configs/vgg/vgg16_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/vgg16.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg16bn_8xb32_in1k.py b/configs/vgg/vgg16bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6bbb81b86b279bbf84d7b877ef3bc370dedbf4e
--- /dev/null
+++ b/configs/vgg/vgg16bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vgg16bn.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vgg/vgg19_8xb32_in1k.py b/configs/vgg/vgg19_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..88cd24c1dd9cb28dc3c91e4403b241c441dfbe03
--- /dev/null
+++ b/configs/vgg/vgg19_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+ '../_base_/models/vgg19.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg19bn_8xb32_in1k.py b/configs/vgg/vgg19bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b4f34aba0ad5f665b86a8173af9e4436546af23
--- /dev/null
+++ b/configs/vgg/vgg19bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vgg19bn.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/README.md b/configs/vig/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..624e387ac3799f599cbd886e9053cfa1d2a2de95
--- /dev/null
+++ b/configs/vig/README.md
@@ -0,0 +1,81 @@
+# VIG
+
+> [Vision GNN: An Image is Worth Graph of Nodes](https://arxiv.org/abs/2206.00272)
+
+
+
+## Abstract
+
+Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vig-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vig-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/vig/vig-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/vig/vig-tiny_3rdparty_in1k_20230117-6414c684.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `vig-tiny_3rdparty_in1k`\* | From scratch | 7.18 | 1.31 | 74.40 | 92.34 | [config](vig-tiny_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vig/vig-tiny_3rdparty_in1k_20230117-6414c684.pth) |
+| `vig-small_3rdparty_in1k`\* | From scratch | 22.75 | 4.54 | 80.61 | 95.28 | [config](vig-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vig/vig-small_3rdparty_in1k_20230117-5338bf3b.pth) |
+| `vig-base_3rdparty_in1k`\* | From scratch | 20.68 | 17.68 | 82.62 | 96.04 | [config](vig-base_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vig/vig-base_3rdparty_in1k_20230117-92f6f12f.pth) |
+| `pvig-tiny_3rdparty_in1k`\* | From scratch | 9.46 | 1.71 | 78.38 | 94.38 | [config](pvig-tiny_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-tiny_3rdparty_in1k_20230117-eb77347d.pth) |
+| `pvig-small_3rdparty_in1k`\* | From scratch | 29.02 | 4.57 | 82.00 | 95.97 | [config](pvig-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-small_3rdparty_in1k_20230117-9433dc96.pth) |
+| `pvig-medium_3rdparty_in1k`\* | From scratch | 51.68 | 8.89 | 83.12 | 96.35 | [config](pvig-medium_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-medium_3rdparty_in1k_20230117-21057a6d.pth) |
+| `pvig-base_3rdparty_in1k`\* | From scratch | 95.21 | 16.86 | 83.59 | 96.52 | [config](pvig-base_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-base_3rdparty_in1k_20230117-dbab3c85.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{han2022vig,
+ title={Vision GNN: An Image is Worth Graph of Nodes},
+ author={Kai Han and Yunhe Wang and Jianyuan Guo and Yehui Tang and Enhua Wu},
+ booktitle={NeurIPS},
+ year={2022}
+}
+```
diff --git a/configs/vig/metafile.yml b/configs/vig/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..52bd18baf1623bf1f12a95d93c331749847a1339
--- /dev/null
+++ b/configs/vig/metafile.yml
@@ -0,0 +1,134 @@
+Collections:
+ - Name: VIG
+ Metadata:
+ Training Data: ImageNet-1k
+ Architecture:
+ - Vision GNN
+ Paper:
+ Title: 'Vision GNN: An Image is Worth Graph of Nodes'
+ URL: https://arxiv.org/abs/2206.00272
+ README: configs/vig/README.md
+ Code:
+ URL: null
+ Version: null
+
+Models:
+ - Name: vig-tiny_3rdparty_in1k
+ Metadata:
+ FLOPs: 1309000000
+ Parameters: 7185000
+ Training Data: ImageNet-1k
+ In Collection: VIG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.40
+ Top 5 Accuracy: 92.34
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vig/vig-tiny_3rdparty_in1k_20230117-6414c684.pth
+ Config: configs/vig/vig-tiny_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/vig/vig_ti_74.5.pth
+ Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+ - Name: vig-small_3rdparty_in1k
+ Metadata:
+ FLOPs: 4535000000
+ Parameters: 22748000
+ Training Data: ImageNet-1k
+ In Collection: VIG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.61
+ Top 5 Accuracy: 95.28
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vig/vig-small_3rdparty_in1k_20230117-5338bf3b.pth
+ Config: configs/vig/vig-small_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/vig/vig_s_80.6.pth
+ Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+ - Name: vig-base_3rdparty_in1k
+ Metadata:
+ FLOPs: 17681000000
+ Parameters: 20685000
+ Training Data: ImageNet-1k
+ In Collection: VIG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.62
+ Top 5 Accuracy: 96.04
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vig/vig-base_3rdparty_in1k_20230117-92f6f12f.pth
+ Config: configs/vig/vig-base_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/vig/vig_b_82.6.pth
+ Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+ - Name: pvig-tiny_3rdparty_in1k
+ Metadata:
+ FLOPs: 1714000000
+ Parameters: 9458000
+ Training Data: ImageNet-1k
+ In Collection: VIG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.38
+ Top 5 Accuracy: 94.38
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-tiny_3rdparty_in1k_20230117-eb77347d.pth
+ Config: configs/vig/pvig-tiny_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_ti_78.5.pth.tar
+ Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+ - Name: pvig-small_3rdparty_in1k
+ Metadata:
+ FLOPs: 4572000000
+ Parameters: 29024000
+ Training Data: ImageNet-1k
+ In Collection: VIG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.00
+ Top 5 Accuracy: 95.97
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-small_3rdparty_in1k_20230117-9433dc96.pth
+ Config: configs/vig/pvig-small_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_s_82.1.pth.tar
+ Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+ - Name: pvig-medium_3rdparty_in1k
+ Metadata:
+ FLOPs: 8886000000
+ Parameters: 51682000
+ Training Data: ImageNet-1k
+ In Collection: VIG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.12
+ Top 5 Accuracy: 96.35
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-medium_3rdparty_in1k_20230117-21057a6d.pth
+ Config: configs/vig/pvig-medium_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_m_83.1.pth.tar
+ Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+ - Name: pvig-base_3rdparty_in1k
+ Metadata:
+ FLOPs: 16861000000
+ Parameters: 95213000
+ Training Data: ImageNet-1k
+ In Collection: VIG
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.59
+ Top 5 Accuracy: 96.52
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-base_3rdparty_in1k_20230117-dbab3c85.pth
+ Config: configs/vig/pvig-base_8xb128_in1k.py
+ Converted From:
+ Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_b_83.66.pth.tar
+ Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
diff --git a/configs/vig/pvig-base_8xb128_in1k.py b/configs/vig/pvig-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d66359c6c78068e48e0466fede86f11e14e9a91
--- /dev/null
+++ b/configs/vig/pvig-base_8xb128_in1k.py
@@ -0,0 +1,22 @@
+_base_ = [
+ '../_base_/models/vig/pyramid_vig_base.py',
+ '../_base_/datasets/imagenet_bs128_vig_224.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
+
+# dataset settings
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=235,
+ edge='short',
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs'),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/vig/pvig-medium_8xb128_in1k.py b/configs/vig/pvig-medium_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..75c25a2d89b0b8fce8d816d0129afeaf63d6a5e2
--- /dev/null
+++ b/configs/vig/pvig-medium_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vig/pyramid_vig_medium.py',
+ '../_base_/datasets/imagenet_bs128_vig_224.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/pvig-small_8xb128_in1k.py b/configs/vig/pvig-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..755b3319d313f02ce9f1c2f2a943ddd934f7e49b
--- /dev/null
+++ b/configs/vig/pvig-small_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vig/pyramid_vig_small.py',
+ '../_base_/datasets/imagenet_bs128_vig_224.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/pvig-tiny_8xb128_in1k.py b/configs/vig/pvig-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a885559c597962201bed20249f8b688589a7788
--- /dev/null
+++ b/configs/vig/pvig-tiny_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vig/pyramid_vig_tiny.py',
+ '../_base_/datasets/imagenet_bs128_vig_224.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/vig-base_8xb128_in1k.py b/configs/vig/vig-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb8b55e3e841659f65e975947a9859361e34aa28
--- /dev/null
+++ b/configs/vig/vig-base_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vig/vig_base.py',
+ '../_base_/datasets/imagenet_bs128_vig_224.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/vig-small_8xb128_in1k.py b/configs/vig/vig-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..41508b2894d0849cfc92dd2340c71bebdf06f591
--- /dev/null
+++ b/configs/vig/vig-small_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vig/vig_small.py',
+ '../_base_/datasets/imagenet_bs128_vig_224.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/vig-tiny_8xb128_in1k.py b/configs/vig/vig-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..80b1693ad5baecd57d450ae33806e80ddce0f55e
--- /dev/null
+++ b/configs/vig/vig-tiny_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+ '../_base_/models/vig/vig_tiny.py',
+ '../_base_/datasets/imagenet_bs128_vig_224.py',
+ '../_base_/schedules/imagenet_bs256.py',
+ '../_base_/default_runtime.py',
+]
diff --git a/configs/vision_transformer/README.md b/configs/vision_transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..66bd3f529dd85062323c585b38660ab414362250
--- /dev/null
+++ b/configs/vision_transformer/README.md
@@ -0,0 +1,101 @@
+# Vision Transformer
+
+> [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)
+
+
+
+## Introduction
+
+**Vision Transformer**, known as **ViT**, succeeded in using a full transformer to outperform previous works that based on convolutional networks in vision field. ViT splits image into patches to feed the multi-head attentions, concatenates a learnable class token for final prediction and adds a learnable position embeddings for relative positional message between patches. Based on these three techniques with attentions, ViT provides a brand-new pattern to build a basic structure in vision field.
+
+The strategy works even better when coupled with large datasets pre-trainings. Because of its simplicity and effectiveness, some after works in classification field are originated from ViT. And even in recent multi-modality field, ViT-based method still plays a role in it.
+
+
+

+
+
+## Abstract
+
+
+
+Show the paper's abstract
+
+
+
+While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
+
+
+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p32_in21k-pre_3rdparty_in1k-384px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-base-p32_in21k-pre_3rdparty_in1k-384px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------: | :----------------------------------------------------------: |
+| `vit-base-p32_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 88.30 | 13.06 | 84.01 | 97.08 | [config](vit-base-p32_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth) |
+| `vit-base-p16_32xb128-mae_in1k` | From scratch | 86.57 | 17.58 | 82.37 | 96.15 | [config](vit-base-p16_32xb128-mae_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-mae_in1k_20220623-4c544545.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-mae_in1k_20220623-4c544545.log) |
+| `vit-base-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 86.86 | 55.54 | 85.43 | 97.77 | [config](vit-base-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth) |
+| `vit-large-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k | 304.72 | 191.21 | 85.63 | 97.63 | [config](vit-large-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-b20ba619.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{
+ dosovitskiy2021an,
+ title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
+ author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
+ booktitle={International Conference on Learning Representations},
+ year={2021},
+ url={https://openreview.net/forum?id=YicbFdNTTy}
+}
+```
diff --git a/configs/vision_transformer/metafile.yml b/configs/vision_transformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..891c413ab6c5b579eb5d404b7b7e7d01fe94b8d8
--- /dev/null
+++ b/configs/vision_transformer/metafile.yml
@@ -0,0 +1,95 @@
+Collections:
+ - Name: Vision Transformer
+ Metadata:
+ Architecture:
+ - Attention Dropout
+ - Convolution
+ - Dense Connections
+ - Dropout
+ - GELU
+ - Layer Normalization
+ - Multi-Head Attention
+ - Scaled Dot-Product Attention
+ - Tanh Activation
+ Paper:
+ Title: 'An Image is Worth 16x16 Words: Transformers for Image Recognition at
+ Scale'
+ URL: https://arxiv.org/abs/2010.11929
+ README: configs/vision_transformer/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.17.0/mmcls/models/backbones/vision_transformer.py
+ Version: v0.17.0
+
+Models:
+ - Name: vit-base-p32_in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 13056716544
+ Parameters: 88297192
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: Vision Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 84.01
+ Top 5 Accuracy: 97.08
+ Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth
+ Config: configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py
+ Converted From:
+ Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_32-i21k-300ep-lr_0.001-aug_light1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
+ Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
+ - Name: vit-base-p16_32xb128-mae_in1k
+ Metadata:
+ FLOPs: 17581972224
+ Parameters: 86567656
+ Training Data:
+ - ImageNet-1k
+ In Collection: Vision Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 82.37
+ Top 5 Accuracy: 96.15
+ Weights: https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-mae_in1k_20220623-4c544545.pth
+ Config: configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
+ - Name: vit-base-p16_in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 55538974464
+ Parameters: 86859496
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: Vision Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.43
+ Top 5 Accuracy: 97.77
+ Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth
+ Config: configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py
+ Converted From:
+ Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz
+ Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
+ - Name: vit-large-p16_in21k-pre_3rdparty_in1k-384px
+ Metadata:
+ FLOPs: 191210034176
+ Parameters: 304715752
+ Training Data:
+ - ImageNet-21k
+ - ImageNet-1k
+ In Collection: Vision Transformer
+ Results:
+ - Dataset: ImageNet-1k
+ Task: Image Classification
+ Metrics:
+ Top 1 Accuracy: 85.63
+ Top 5 Accuracy: 97.63
+ Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-b20ba619.pth
+ Config: configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py
+ Converted From:
+ Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/L_16-i21k-300ep-lr_0.001-aug_strong1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
+ Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
diff --git a/configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py b/configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a46bbb21a99b34f792f277759b4dccb75c88b2ed
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='VisionTransformer',
+ arch='base',
+ img_size=224,
+ patch_size=16,
+ drop_path_rate=0.1),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+ ),
+ init_cfg=[
+ dict(type='TruncNormal', layer='Linear', std=.02),
+ dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+ ],
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0)
+ ]))
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(
+ type='AdamW',
+ lr=1e-4 * 4096 / 256,
+ weight_decay=0.3,
+ eps=1e-8,
+ betas=(0.9, 0.95)),
+ paramwise_cfg=dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+ }))
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=1e-4)]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/vision_transformer/vit-base-p16_4xb544-ipu_in1k.py b/configs/vision_transformer/vit-base-p16_4xb544-ipu_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d378b3b265b30b7f3e492dcf22527fed5cd9beb4
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_4xb544-ipu_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+ '../_base_/models/vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+ '../_base_/default_runtime.py'
+]
+
+# specific to vit pretrain
+paramwise_cfg = dict(custom_keys={
+ '.cls_token': dict(decay_mult=0.0),
+ '.pos_embed': dict(decay_mult=0.0)
+})
+
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-base-p16_3rdparty_pt-64xb64_in1k-224_20210928-02284250.pth' # noqa
+
+model = dict(
+ head=dict(
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0, _delete_=True), ),
+ backbone=dict(
+ img_size=224,
+ init_cfg=dict(
+ type='Pretrained',
+ checkpoint=pretrained,
+ _delete_=True,
+ prefix='backbone')))
+
+img_norm_cfg = dict(
+ mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='ImageToTensor', keys=['img']),
+ dict(type='ToTensor', keys=['gt_label']),
+ dict(type='ToHalf', keys=['img']),
+ dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='Resize', scale=(224, -1), keep_ratio=True, backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='ImageToTensor', keys=['img']),
+ dict(type='ToHalf', keys=['img']),
+ dict(type='Collect', keys=['img'])
+]
+
+# change batch size
+data = dict(
+ samples_per_gpu=17,
+ workers_per_gpu=16,
+ drop_last=True,
+ train=dict(pipeline=train_pipeline),
+ train_dataloader=dict(mode='async'),
+ val=dict(pipeline=test_pipeline, ),
+ val_dataloader=dict(samples_per_gpu=4, workers_per_gpu=1),
+ test=dict(pipeline=test_pipeline),
+ test_dataloader=dict(samples_per_gpu=4, workers_per_gpu=1))
+
+# optimizer
+optimizer = dict(
+ type='SGD',
+ lr=0.08,
+ weight_decay=1e-5,
+ momentum=0.9,
+ paramwise_cfg=paramwise_cfg,
+)
+
+# learning policy
+param_scheduler = [
+ dict(type='LinearLR', start_factor=0.02, by_epoch=False, begin=0, end=800),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=4200,
+ by_epoch=False,
+ begin=800,
+ end=5000)
+]
+
+# ipu cfg
+# model partition config
+ipu_model_cfg = dict(
+ train_split_edges=[
+ dict(layer_to_call='backbone.patch_embed', ipu_id=0),
+ dict(layer_to_call='backbone.layers.3', ipu_id=1),
+ dict(layer_to_call='backbone.layers.6', ipu_id=2),
+ dict(layer_to_call='backbone.layers.9', ipu_id=3)
+ ],
+ train_ckpt_nodes=['backbone.layers.{}'.format(i) for i in range(12)])
+
+# device config
+options_cfg = dict(
+ randomSeed=42,
+ partialsType='half',
+ train_cfg=dict(
+ executionStrategy='SameAsIpu',
+ Training=dict(gradientAccumulation=32),
+ availableMemoryProportion=[0.3, 0.3, 0.3, 0.3],
+ ),
+ eval_cfg=dict(deviceIterations=1, ),
+)
+
+# add model partition config and device config to runner
+runner = dict(
+ type='IterBasedRunner',
+ ipu_model_cfg=ipu_model_cfg,
+ options_cfg=options_cfg,
+ max_iters=5000)
+
+default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=1000))
+
+fp16 = dict(loss_scale=256.0, velocity_accum_type='half', accum_type='half')
diff --git a/configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py b/configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0f745874bcef7e3896cfc694c16bf4e5a235fae
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+ '../_base_/models/vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p16_64xb64_in1k.py b/configs/vision_transformer/vit-base-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..07be0e9a373a324f07989476314d391f2fee4f8e
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+ '../_base_/models/vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+ head=dict(hidden_dim=3072),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p16_8xb64-lora_in1k-384px.py b/configs/vision_transformer/vit-base-p16_8xb64-lora_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffe1018e5d9c0f724911b782a555cb34d50d6ceb
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_8xb64-lora_in1k-384px.py
@@ -0,0 +1,84 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='LoRAModel',
+ module=dict(
+ type='VisionTransformer',
+ arch='b',
+ img_size=384,
+ patch_size=16,
+ drop_rate=0.1,
+ init_cfg=dict(type='Pretrained', checkpoint='',
+ prefix='backbone')),
+ alpha=16,
+ rank=16,
+ drop_rate=0.1,
+ targets=[dict(type='qkv')]),
+ neck=None,
+ head=dict(
+ type='VisionTransformerClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(
+ type='LabelSmoothLoss', label_smooth_val=0.1,
+ mode='classy_vision'),
+ init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)],
+ ))
+
+# dataset setting
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=1e-4,
+ by_epoch=True,
+ begin=0,
+ end=5,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=45,
+ by_epoch=True,
+ begin=5,
+ end=50,
+ eta_min=1e-6,
+ convert_to_iter_based=True)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py b/configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5a4d14f4dad0759f70b9b9e29c085ad7eff292c
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+ '../_base_/models/vit-base-p32.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p32_64xb64_in1k.py b/configs/vision_transformer/vit-base-p32_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cfc7c47df0887e4ace1bbaeb59bb5d42e004a83
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p32_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+ '../_base_/models/vit-base-p32.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+ head=dict(hidden_dim=3072),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py b/configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..98e96ec68ffdaca2648e1ac2ae5a79db30ec8382
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+ '../_base_/models/vit-large-p16.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p16_64xb64_in1k.py b/configs/vision_transformer/vit-large-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d9bd283b779af36df99574bbdde7701c6b41393
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p16_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+ '../_base_/models/vit-large-p16.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+ head=dict(hidden_dim=3072),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p32_64xb64_in1k-384px.py b/configs/vision_transformer/vit-large-p32_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..22320d119890bb80aca47e45322dabeee4d0feb7
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p32_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+ '../_base_/models/vit-large-p32.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ # convert image from BGR to RGB
+ to_rgb=True,
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+ dict(type='CenterCrop', crop_size=384),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p32_64xb64_in1k.py b/configs/vision_transformer/vit-large-p32_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..61e179165b84d8aa521426aa992cc2460d7ae0a5
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p32_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+ '../_base_/models/vit-large-p32.py',
+ '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+ '../_base_/schedules/imagenet_bs4096_AdamW.py',
+ '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+ head=dict(hidden_dim=3072),
+ train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/wrn/README.md b/configs/wrn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2753307b06699b4235aaf1465f0ce5cf89a30952
--- /dev/null
+++ b/configs/wrn/README.md
@@ -0,0 +1,76 @@
+# Wide-ResNet
+
+> [Wide Residual Networks](https://arxiv.org/abs/1605.07146)
+
+
+
+## Abstract
+
+Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet.
+
+
+

+
+
+## How to use it?
+
+
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('wide-resnet50_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('wide-resnet50_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/wrn/wide-resnet50_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth
+```
+
+
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
+| :----------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :-----------------------------------------------------------------: |
+| `wide-resnet50_3rdparty_8xb32_in1k`\* | From scratch | 68.88 | 11.44 | 78.48 | 94.08 | [config](wide-resnet50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth) |
+| `wide-resnet101_3rdparty_8xb32_in1k`\* | From scratch | 126.89 | 22.81 | 78.84 | 94.28 | [config](wide-resnet101_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet101_3rdparty_8xb32_in1k_20220304-8d5f9d61.pth) |
+| `wide-resnet50_3rdparty-timm_8xb32_in1k`\* | From scratch | 68.88 | 11.44 | 81.45 | 95.53 | [config](wide-resnet50_timm_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty-timm_8xb32_in1k_20220304-83ae4399.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@INPROCEEDINGS{Zagoruyko2016WRN,
+ author = {Sergey Zagoruyko and Nikos Komodakis},
+ title = {Wide Residual Networks},
+ booktitle = {BMVC},
+ year = {2016}}
+```
diff --git a/configs/wrn/metafile.yml b/configs/wrn/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..75e346720cf626c923514e01a5bd3ed33849da9a
--- /dev/null
+++ b/configs/wrn/metafile.yml
@@ -0,0 +1,77 @@
+Collections:
+ - Name: Wide-ResNet
+ Metadata:
+ Training Data: ImageNet-1k
+ Training Techniques:
+ - SGD with Momentum
+ - Weight Decay
+ Training Resources: 8x V100 GPUs
+ Epochs: 100
+ Batch Size: 256
+ Architecture:
+ - 1x1 Convolution
+ - Batch Normalization
+ - Convolution
+ - Global Average Pooling
+ - Max Pooling
+ - ReLU
+ - Residual Connection
+ - Softmax
+ - Wide Residual Block
+ Paper:
+ URL: https://arxiv.org/abs/1605.07146
+ Title: "Wide Residual Networks"
+ README: configs/wrn/README.md
+ Code:
+ URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/resnet.py#L383
+ Version: v0.20.1
+
+Models:
+ - Name: wide-resnet50_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 11440000000 # 11.44G
+ Parameters: 68880000 # 68.88M
+ In Collection: Wide-ResNet
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.48
+ Top 5 Accuracy: 94.08
+ Weights: https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth
+ Config: configs/wrn/wide-resnet50_8xb32_in1k.py
+ Converted From:
+ Weights: https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth
+ Code: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
+ - Name: wide-resnet101_3rdparty_8xb32_in1k
+ Metadata:
+ FLOPs: 22810000000 # 22.81G
+ Parameters: 126890000 # 126.89M
+ In Collection: Wide-ResNet
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.84
+ Top 5 Accuracy: 94.28
+ Weights: https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet101_3rdparty_8xb32_in1k_20220304-8d5f9d61.pth
+ Config: configs/wrn/wide-resnet101_8xb32_in1k.py
+ Converted From:
+ Weights: https://download.pytorch.org/models/wide_resnet101_2-32ee1156.pth
+ Code: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
+ - Name: wide-resnet50_3rdparty-timm_8xb32_in1k
+ Metadata:
+ FLOPs: 11440000000 # 11.44G
+ Parameters: 68880000 # 68.88M
+ In Collection: Wide-ResNet
+ Results:
+ - Task: Image Classification
+ Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.45
+ Top 5 Accuracy: 95.53
+ Weights: https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty-timm_8xb32_in1k_20220304-83ae4399.pth
+ Config: configs/wrn/wide-resnet50_timm_8xb32_in1k.py
+ Converted From:
+ Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/wide_resnet50_racm-8234f177.pth
+ Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py
diff --git a/configs/wrn/wide-resnet101_8xb32_in1k.py b/configs/wrn/wide-resnet101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1bf5e5e5fac3655bd27f64f4c5c5a1316403a3b
--- /dev/null
+++ b/configs/wrn/wide-resnet101_8xb32_in1k.py
@@ -0,0 +1,7 @@
+_base_ = [
+ '../_base_/models/wide-resnet50.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(depth=101))
diff --git a/configs/wrn/wide-resnet50_8xb32_in1k.py b/configs/wrn/wide-resnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..edf6a0518ac73f4eaa54f261ecbfce8acf0f2035
--- /dev/null
+++ b/configs/wrn/wide-resnet50_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/wide-resnet50.py',
+ '../_base_/datasets/imagenet_bs32_pil_resize.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/wrn/wide-resnet50_timm_8xb32_in1k.py b/configs/wrn/wide-resnet50_timm_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8dca8f37319f8d60df0e42123b2ebe16a3f7d9d8
--- /dev/null
+++ b/configs/wrn/wide-resnet50_timm_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+ '../_base_/models/wide-resnet50.py',
+ '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+ '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/xcit/README.md b/configs/xcit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ab2cd7a3634e4d877bca3d5125d3506d3861b428
--- /dev/null
+++ b/configs/xcit/README.md
@@ -0,0 +1,106 @@
+# XCiT
+
+> [XCiT: Cross-Covariance Image Transformers](https://arxiv.org/abs/2106.09681)
+
+
+
+## Abstract
+
+Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
+
+
+

+
+
+## How to use it?
+
+
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('xcit-nano-12-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/xcit/xcit-nano-12-p16_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty_in1k_20230213-ed776c38.pth
+```
+
+
+
+## Models and results
+
+### Pretrained models
+
+| Model | Params (M) | Flops (G) | Config | Download |
+| :---------------------------------------------- | :--------: | :-------: | :-----------------------------------------------: | :-----------------------------------------------------------------------------------: |
+| `xcit-nano-12-p16_3rdparty_in1k`\* | 3.05 | 0.56 | [config](xcit-nano-12-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty_in1k_20230213-ed776c38.pth) |
+| `xcit-nano-12-p16_3rdparty-dist_in1k`\* | 3.05 | 0.56 | [config](xcit-nano-12-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k_20230213-fb247f7b.pth) |
+| `xcit-tiny-12-p16_3rdparty_in1k`\* | 6.72 | 1.24 | [config](xcit-tiny-12-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty_in1k_20230213-82c547ca.pth) |
+| `xcit-tiny-12-p16_3rdparty-dist_in1k`\* | 6.72 | 1.24 | [config](xcit-tiny-12-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k_20230213-d5fde0a3.pth) |
+| `xcit-nano-12-p16_3rdparty-dist_in1k-384px`\* | 3.05 | 1.64 | [config](xcit-nano-12-p16_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k-384px_20230213-712db4d4.pth) |
+| `xcit-nano-12-p8_3rdparty_in1k`\* | 3.05 | 2.16 | [config](xcit-nano-12-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty_in1k_20230213-3370c293.pth) |
+| `xcit-nano-12-p8_3rdparty-dist_in1k`\* | 3.05 | 2.16 | [config](xcit-nano-12-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k_20230213-2f87d2b3.pth) |
+| `xcit-tiny-24-p16_3rdparty_in1k`\* | 12.12 | 2.34 | [config](xcit-tiny-24-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty_in1k_20230213-366c1cd0.pth) |
+| `xcit-tiny-24-p16_3rdparty-dist_in1k`\* | 12.12 | 2.34 | [config](xcit-tiny-24-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k_20230213-b472e80a.pth) |
+| `xcit-tiny-12-p16_3rdparty-dist_in1k-384px`\* | 6.72 | 3.64 | [config](xcit-tiny-12-p16_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k-384px_20230213-00a20023.pth) |
+| `xcit-tiny-12-p8_3rdparty_in1k`\* | 6.71 | 4.81 | [config](xcit-tiny-12-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty_in1k_20230213-8b02f8f5.pth) |
+| `xcit-tiny-12-p8_3rdparty-dist_in1k`\* | 6.71 | 4.81 | [config](xcit-tiny-12-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k_20230213-f3f9b44f.pth) |
+| `xcit-small-12-p16_3rdparty_in1k`\* | 26.25 | 4.81 | [config](xcit-small-12-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty_in1k_20230213-d36779d2.pth) |
+| `xcit-small-12-p16_3rdparty-dist_in1k`\* | 26.25 | 4.81 | [config](xcit-small-12-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k_20230213-c95bbae1.pth) |
+| `xcit-nano-12-p8_3rdparty-dist_in1k-384px`\* | 3.05 | 6.34 | [config](xcit-nano-12-p8_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k-384px_20230213-09d925ef.pth) |
+| `xcit-tiny-24-p16_3rdparty-dist_in1k-384px`\* | 12.12 | 6.87 | [config](xcit-tiny-24-p16_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k-384px_20230213-20e13917.pth) |
+| `xcit-small-24-p16_3rdparty_in1k`\* | 47.67 | 9.10 | [config](xcit-small-24-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty_in1k_20230213-40febe38.pth) |
+| `xcit-small-24-p16_3rdparty-dist_in1k`\* | 47.67 | 9.10 | [config](xcit-small-24-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k_20230213-130d7262.pth) |
+| `xcit-tiny-24-p8_3rdparty_in1k`\* | 12.11 | 9.21 | [config](xcit-tiny-24-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty_in1k_20230213-4b9ba392.pth) |
+| `xcit-tiny-24-p8_3rdparty-dist_in1k`\* | 12.11 | 9.21 | [config](xcit-tiny-24-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k_20230213-ad9c44b0.pth) |
+| `xcit-tiny-12-p8_3rdparty-dist_in1k-384px`\* | 6.71 | 14.13 | [config](xcit-tiny-12-p8_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k-384px_20230213-a072174a.pth) |
+| `xcit-small-12-p16_3rdparty-dist_in1k-384px`\* | 26.25 | 14.14 | [config](xcit-small-12-p16_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k-384px_20230213-ba36c982.pth) |
+| `xcit-medium-24-p16_3rdparty_in1k`\* | 84.40 | 16.13 | [config](xcit-medium-24-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty_in1k_20230213-ad0aa92e.pth) |
+| `xcit-medium-24-p16_3rdparty-dist_in1k`\* | 84.40 | 16.13 | [config](xcit-medium-24-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k_20230213-aca5cd0c.pth) |
+| `xcit-small-12-p8_3rdparty_in1k`\* | 26.21 | 18.69 | [config](xcit-small-12-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty_in1k_20230213-9e364ce3.pth) |
+| `xcit-small-12-p8_3rdparty-dist_in1k`\* | 26.21 | 18.69 | [config](xcit-small-12-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k_20230213-71886580.pth) |
+| `xcit-small-24-p16_3rdparty-dist_in1k-384px`\* | 47.67 | 26.72 | [config](xcit-small-24-p16_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k-384px_20230213-28fa2d0e.pth) |
+| `xcit-tiny-24-p8_3rdparty-dist_in1k-384px`\* | 12.11 | 27.05 | [config](xcit-tiny-24-p8_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k-384px_20230213-30d5e5ec.pth) |
+| `xcit-small-24-p8_3rdparty_in1k`\* | 47.63 | 35.81 | [config](xcit-small-24-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty_in1k_20230213-280ebcc7.pth) |
+| `xcit-small-24-p8_3rdparty-dist_in1k`\* | 47.63 | 35.81 | [config](xcit-small-24-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k_20230213-f2773c78.pth) |
+| `xcit-large-24-p16_3rdparty_in1k`\* | 189.10 | 35.86 | [config](xcit-large-24-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty_in1k_20230214-d29d2529.pth) |
+| `xcit-large-24-p16_3rdparty-dist_in1k`\* | 189.10 | 35.86 | [config](xcit-large-24-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k_20230214-4fea599c.pth) |
+| `xcit-medium-24-p16_3rdparty-dist_in1k-384px`\* | 84.40 | 47.39 | [config](xcit-medium-24-p16_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k-384px_20230214-6c23a201.pth) |
+| `xcit-small-12-p8_3rdparty-dist_in1k-384px`\* | 26.21 | 54.92 | [config](xcit-small-12-p8_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k-384px_20230214-9f2178bc.pth) |
+| `xcit-medium-24-p8_3rdparty_in1k`\* | 84.32 | 63.52 | [config](xcit-medium-24-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty_in1k_20230214-c362850b.pth) |
+| `xcit-medium-24-p8_3rdparty-dist_in1k`\* | 84.32 | 63.52 | [config](xcit-medium-24-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k_20230214-625c953b.pth) |
+| `xcit-small-24-p8_3rdparty-dist_in1k-384px`\* | 47.63 | 105.24 | [config](xcit-small-24-p8_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k-384px_20230214-57298eca.pth) |
+| `xcit-large-24-p16_3rdparty-dist_in1k-384px`\* | 189.10 | 105.35 | [config](xcit-large-24-p16_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k-384px_20230214-bd515a34.pth) |
+| `xcit-large-24-p8_3rdparty_in1k`\* | 188.93 | 141.23 | [config](xcit-large-24-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty_in1k_20230214-08f2f664.pth) |
+| `xcit-large-24-p8_3rdparty-dist_in1k`\* | 188.93 | 141.23 | [config](xcit-large-24-p8_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k_20230214-8c092b34.pth) |
+| `xcit-medium-24-p8_3rdparty-dist_in1k-384px`\* | 84.32 | 186.67 | [config](xcit-medium-24-p8_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k-384px_20230214-5db925e0.pth) |
+| `xcit-large-24-p8_3rdparty-dist_in1k-384px`\* | 188.93 | 415.00 | [config](xcit-large-24-p8_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k-384px_20230214-9f718b1a.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/xcit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{el2021xcit,
+ title={XCiT: Cross-Covariance Image Transformers},
+ author={El-Nouby, Alaaeldin and Touvron, Hugo and Caron, Mathilde and Bojanowski, Piotr and Douze, Matthijs and Joulin, Armand and Laptev, Ivan and Neverova, Natalia and Synnaeve, Gabriel and Verbeek, Jakob and others},
+ journal={arXiv preprint arXiv:2106.09681},
+ year={2021}
+}
+```
diff --git a/configs/xcit/metafile.yml b/configs/xcit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8379da1927cae6a45433351ca0b930b54f0e9ba7
--- /dev/null
+++ b/configs/xcit/metafile.yml
@@ -0,0 +1,727 @@
+Collections:
+ - Name: XCiT
+ Metadata:
+ Architecture:
+ - Class Attention
+ - Local Patch Interaction
+ - Cross-Covariance Attention
+ Paper:
+ Title: 'XCiT: Cross-Covariance Image Transformers'
+ URL: https://arxiv.org/abs/2106.09681
+ README: configs/xcit/README.md
+
+Models:
+ - Name: xcit-nano-12-p16_3rdparty_in1k
+ Metadata:
+ FLOPs: 557074560
+ Parameters: 3053224
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 70.35
+ Top 5 Accuracy: 89.98
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty_in1k_20230213-ed776c38.pth
+ Config: configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p16_224.pth
+ - Name: xcit-nano-12-p16_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 557074560
+ Parameters: 3053224
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 72.36
+ Top 5 Accuracy: 91.02
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k_20230213-fb247f7b.pth
+ Config: configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p16_224_dist.pth
+ - Name: xcit-tiny-12-p16_3rdparty_in1k
+ Metadata:
+ FLOPs: 1239698112
+ Parameters: 6716272
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.21
+ Top 5 Accuracy: 93.62
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty_in1k_20230213-82c547ca.pth
+ Config: configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p16_224.pth
+ - Name: xcit-tiny-12-p16_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 1239698112
+ Parameters: 6716272
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 78.7
+ Top 5 Accuracy: 94.12
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k_20230213-d5fde0a3.pth
+ Config: configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p16_224_dist.pth
+ - Name: xcit-nano-12-p16_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 1636347520
+ Parameters: 3053224
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 74.93
+ Top 5 Accuracy: 92.42
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k-384px_20230213-712db4d4.pth
+ Config: configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p16_384_dist.pth
+ - Name: xcit-nano-12-p8_3rdparty_in1k
+ Metadata:
+ FLOPs: 2156861056
+ Parameters: 3049016
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 73.8
+ Top 5 Accuracy: 92.08
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty_in1k_20230213-3370c293.pth
+ Config: configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p8_224.pth
+ - Name: xcit-nano-12-p8_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 2156861056
+ Parameters: 3049016
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 76.17
+ Top 5 Accuracy: 93.08
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k_20230213-2f87d2b3.pth
+ Config: configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p8_224_dist.pth
+ - Name: xcit-tiny-24-p16_3rdparty_in1k
+ Metadata:
+ FLOPs: 2339305152
+ Parameters: 12116896
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.47
+ Top 5 Accuracy: 94.85
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty_in1k_20230213-366c1cd0.pth
+ Config: configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p16_224.pth
+ - Name: xcit-tiny-24-p16_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 2339305152
+ Parameters: 12116896
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.51
+ Top 5 Accuracy: 95.17
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k_20230213-b472e80a.pth
+ Config: configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p16_224_dist.pth
+ - Name: xcit-tiny-12-p16_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 3641468352
+ Parameters: 6716272
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 80.58
+ Top 5 Accuracy: 95.38
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k-384px_20230213-00a20023.pth
+ Config: configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p16_384_dist.pth
+ - Name: xcit-tiny-12-p8_3rdparty_in1k
+ Metadata:
+ FLOPs: 4807399872
+ Parameters: 6706504
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 79.75
+ Top 5 Accuracy: 94.88
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty_in1k_20230213-8b02f8f5.pth
+ Config: configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p8_224.pth
+ - Name: xcit-tiny-12-p8_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 4807399872
+ Parameters: 6706504
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.26
+ Top 5 Accuracy: 95.46
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k_20230213-f3f9b44f.pth
+ Config: configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p8_224_dist.pth
+ - Name: xcit-small-12-p16_3rdparty_in1k
+ Metadata:
+ FLOPs: 4814951808
+ Parameters: 26253304
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.87
+ Top 5 Accuracy: 95.77
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty_in1k_20230213-d36779d2.pth
+ Config: configs/xcit/xcit-small-12-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_224.pth
+ - Name: xcit-small-12-p16_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 4814951808
+ Parameters: 26253304
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.12
+ Top 5 Accuracy: 96.41
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k_20230213-c95bbae1.pth
+ Config: configs/xcit/xcit-small-12-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_224_dist.pth
+ - Name: xcit-nano-12-p8_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 6337760896
+ Parameters: 3049016
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 77.69
+ Top 5 Accuracy: 94.09
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k-384px_20230213-09d925ef.pth
+ Config: configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p8_384_dist.pth
+ - Name: xcit-tiny-24-p16_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 6872966592
+ Parameters: 12116896
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.43
+ Top 5 Accuracy: 96.2
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k-384px_20230213-20e13917.pth
+ Config: configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p16_384_dist.pth
+ - Name: xcit-small-24-p16_3rdparty_in1k
+ Metadata:
+ FLOPs: 9095064960
+ Parameters: 47671384
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.38
+ Top 5 Accuracy: 95.93
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty_in1k_20230213-40febe38.pth
+ Config: configs/xcit/xcit-small-24-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p16_224.pth
+ - Name: xcit-small-24-p16_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 9095064960
+ Parameters: 47671384
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.7
+ Top 5 Accuracy: 96.61
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k_20230213-130d7262.pth
+ Config: configs/xcit/xcit-small-24-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p16_224_dist.pth
+ - Name: xcit-tiny-24-p8_3rdparty_in1k
+ Metadata:
+ FLOPs: 9205828032
+ Parameters: 12107128
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 81.7
+ Top 5 Accuracy: 95.9
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty_in1k_20230213-4b9ba392.pth
+ Config: configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p8_224.pth
+ - Name: xcit-tiny-24-p8_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 9205828032
+ Parameters: 12107128
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.62
+ Top 5 Accuracy: 96.16
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k_20230213-ad9c44b0.pth
+ Config: configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p8_224_dist.pth
+ - Name: xcit-tiny-12-p8_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 14126142912
+ Parameters: 6706504
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.46
+ Top 5 Accuracy: 96.22
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k-384px_20230213-a072174a.pth
+ Config: configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p8_384_dist.pth
+ - Name: xcit-small-12-p16_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 14143179648
+ Parameters: 26253304
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.74
+ Top 5 Accuracy: 97.19
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k-384px_20230213-ba36c982.pth
+ Config: configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_384_dist.pth
+ - Name: xcit-medium-24-p16_3rdparty_in1k
+ Metadata:
+ FLOPs: 16129561088
+ Parameters: 84395752
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.56
+ Top 5 Accuracy: 95.82
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty_in1k_20230213-ad0aa92e.pth
+ Config: configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p16_224.pth
+ - Name: xcit-medium-24-p16_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 16129561088
+ Parameters: 84395752
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.15
+ Top 5 Accuracy: 96.82
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k_20230213-aca5cd0c.pth
+ Config: configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p16_224_dist.pth
+ - Name: xcit-small-12-p8_3rdparty_in1k
+ Metadata:
+ FLOPs: 18691601280
+ Parameters: 26213032
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.21
+ Top 5 Accuracy: 96.41
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty_in1k_20230213-9e364ce3.pth
+ Config: configs/xcit/xcit-small-12-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p8_224.pth
+ - Name: xcit-small-12-p8_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 18691601280
+ Parameters: 26213032
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.97
+ Top 5 Accuracy: 96.81
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k_20230213-71886580.pth
+ Config: configs/xcit/xcit-small-12-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p8_224_dist.pth
+ - Name: xcit-small-24-p16_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 26721471360
+ Parameters: 47671384
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.1
+ Top 5 Accuracy: 97.32
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k-384px_20230213-28fa2d0e.pth
+ Config: configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p16_384_dist.pth
+ - Name: xcit-tiny-24-p8_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 27052135872
+ Parameters: 12107128
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.77
+ Top 5 Accuracy: 96.72
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k-384px_20230213-30d5e5ec.pth
+ Config: configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p8_384_dist.pth
+ - Name: xcit-small-24-p8_3rdparty_in1k
+ Metadata:
+ FLOPs: 35812053888
+ Parameters: 47631112
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.62
+ Top 5 Accuracy: 96.51
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty_in1k_20230213-280ebcc7.pth
+ Config: configs/xcit/xcit-small-24-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p8_224.pth
+ - Name: xcit-small-24-p8_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 35812053888
+ Parameters: 47631112
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.68
+ Top 5 Accuracy: 97.07
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k_20230213-f2773c78.pth
+ Config: configs/xcit/xcit-small-24-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p8_224_dist.pth
+ - Name: xcit-large-24-p16_3rdparty_in1k
+ Metadata:
+ FLOPs: 35855948544
+ Parameters: 189096136
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 82.97
+ Top 5 Accuracy: 95.86
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty_in1k_20230214-d29d2529.pth
+ Config: configs/xcit/xcit-large-24-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p16_224.pth
+ - Name: xcit-large-24-p16_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 35855948544
+ Parameters: 189096136
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.61
+ Top 5 Accuracy: 97.07
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k_20230214-4fea599c.pth
+ Config: configs/xcit/xcit-large-24-p16_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p16_224_dist.pth
+ - Name: xcit-medium-24-p16_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 47388932608
+ Parameters: 84395752
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.47
+ Top 5 Accuracy: 97.49
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k-384px_20230214-6c23a201.pth
+ Config: configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p16_384_dist.pth
+ - Name: xcit-small-12-p8_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 54923537280
+ Parameters: 26213032
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.12
+ Top 5 Accuracy: 97.31
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k-384px_20230214-9f2178bc.pth
+ Config: configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p8_384_dist.pth
+ - Name: xcit-medium-24-p8_3rdparty_in1k
+ Metadata:
+ FLOPs: 63524706816
+ Parameters: 84323624
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 83.61
+ Top 5 Accuracy: 96.23
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty_in1k_20230214-c362850b.pth
+ Config: configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p8_224.pth
+ - Name: xcit-medium-24-p8_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 63524706816
+ Parameters: 84323624
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.0
+ Top 5 Accuracy: 97.16
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k_20230214-625c953b.pth
+ Config: configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p8_224_dist.pth
+ - Name: xcit-small-24-p8_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 105236704128
+ Parameters: 47631112
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.57
+ Top 5 Accuracy: 97.6
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k-384px_20230214-57298eca.pth
+ Config: configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p8_384_dist.pth
+ - Name: xcit-large-24-p16_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 105345095424
+ Parameters: 189096136
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.78
+ Top 5 Accuracy: 97.6
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k-384px_20230214-bd515a34.pth
+ Config: configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p16_384_dist.pth
+ - Name: xcit-large-24-p8_3rdparty_in1k
+ Metadata:
+ FLOPs: 141225699072
+ Parameters: 188932648
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 84.23
+ Top 5 Accuracy: 96.58
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty_in1k_20230214-08f2f664.pth
+ Config: configs/xcit/xcit-large-24-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p8_224.pth
+ - Name: xcit-large-24-p8_3rdparty-dist_in1k
+ Metadata:
+ FLOPs: 141225699072
+ Parameters: 188932648
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.14
+ Top 5 Accuracy: 97.32
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k_20230214-8c092b34.pth
+ Config: configs/xcit/xcit-large-24-p8_8xb128_in1k.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p8_224_dist.pth
+ - Name: xcit-medium-24-p8_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 186672626176
+ Parameters: 84323624
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 85.87
+ Top 5 Accuracy: 97.61
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k-384px_20230214-5db925e0.pth
+ Config: configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p8_384_dist.pth
+ - Name: xcit-large-24-p8_3rdparty-dist_in1k-384px
+ Metadata:
+ FLOPs: 415003137792
+ Parameters: 188932648
+ Training Data: ImageNet-1k
+ In Collection: XCiT
+ Results:
+ - Dataset: ImageNet-1k
+ Metrics:
+ Top 1 Accuracy: 86.13
+ Top 5 Accuracy: 97.75
+ Task: Image Classification
+ Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k-384px_20230214-9f718b1a.pth
+ Config: configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py
+ Converted From:
+ Code: https://github.com/facebookresearch/xcit
+ Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p8_384_dist.pth
diff --git a/configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b393c4aea03ab1927e11773609562cd323963931
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=768,
+ depth=24,
+ num_heads=16,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-large-24-p16_8xb128_in1k.py b/configs/xcit/xcit-large-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5c01cb5f72e93ad8b5e81d363b3c3f914504f64
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=768,
+ depth=24,
+ num_heads=16,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..46b8422b481e69100266798a2183cae56d6e345e
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=768,
+ depth=24,
+ num_heads=16,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-large-24-p8_8xb128_in1k.py b/configs/xcit/xcit-large-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6dc67baa59b9e270b2c06bb0a928879ef8f78f60
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=768,
+ depth=24,
+ num_heads=16,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=768,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c91b9cd6e9511a8dbbae437a5454d35eb4c03e0
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=512,
+ depth=24,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p16_8xb128_in1k.py b/configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..148ed0640da548877cbf04c67bfc0bbb3351dfce
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=512,
+ depth=24,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..3138ec4f0b41456d99e2d59d60575327e794f10e
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=512,
+ depth=24,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p8_8xb128_in1k.py b/configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8277a10b772aa3c7a39ace2051829c8818df987
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=512,
+ depth=24,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=512,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf8c27b3b1acee69892fa83a8be40da82b62fd44
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=128,
+ depth=12,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=False,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=128,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p16_8xb128_in1k.py b/configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e9bf81c5f4639ee5c7ba57c9ef996c79076df65
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=128,
+ depth=12,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=False,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=128,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7dae69f0b3b9a2ea8792f0beed8e0ee68f0cc4e9
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=128,
+ depth=12,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=False,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=128,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p8_8xb128_in1k.py b/configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6a003a30ef7348f29732ca1c36210704e886c1c
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=128,
+ depth=12,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=False,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=128,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..54c80d498e0c1370f1122ee34ef1970a521796a7
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=384,
+ depth=12,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p16_8xb128_in1k.py b/configs/xcit/xcit-small-12-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c546179f42f7a0a668d3d7f8d27ae137006577ae
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=384,
+ depth=12,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1b6a52c370578f9fe9420521d1bc494563071e6
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=384,
+ depth=12,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p8_8xb128_in1k.py b/configs/xcit/xcit-small-12-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cbfbe151781fb012fae2099bb0a9b9bd5d7e563e
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=384,
+ depth=12,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6eb41275b83939e2ac71f5e6e15fa2a8bf5f4df2
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=384,
+ depth=24,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p16_8xb128_in1k.py b/configs/xcit/xcit-small-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b3dc18f438ffb49bde71a795e24abf36c427e14
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=384,
+ depth=24,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..34445a09d637c222a25aa608de2f99bf1dacedb1
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=384,
+ depth=24,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p8_8xb128_in1k.py b/configs/xcit/xcit-small-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..108e64d41ae0c34c17bc5e6a5baa6d46eb6a9d08
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=384,
+ depth=24,
+ num_heads=8,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=384,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b64ebe497082ef6f9c4b93ad16e7343f66008e07
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=192,
+ depth=12,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1b54592f88bad986e885129bbce9d585fb864206
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=192,
+ depth=12,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1acff7ead898fb45c8ab6eac5aa3ed3dd13d939
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=192,
+ depth=12,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..39d97da21689382d0e6b168fd78f9a74b269e8c1
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=192,
+ depth=12,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1.0,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..556043565e2e844f77a2a2b62e7ebe71d638590d
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=192,
+ depth=24,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fdceb14323ac89a12d529f7112806fef7e6f9d66
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=16,
+ embed_dims=192,
+ depth=24,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cee442e5b77481550d479c4f83cb2e9a80e46ae
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_384.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=192,
+ depth=24,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..283f17e61708e9d19e5af09c57d8a937cec2e854
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+ '../_base_/datasets/imagenet_bs64_swin_224.py',
+ '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+ '../_base_/default_runtime.py',
+]
+
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='XCiT',
+ patch_size=8,
+ embed_dims=192,
+ depth=24,
+ num_heads=4,
+ mlp_ratio=4,
+ qkv_bias=True,
+ layer_scale_init_value=1e-5,
+ tokens_norm=True,
+ out_type='cls_token',
+ ),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=1000,
+ in_channels=192,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ),
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/dataset-index.yml b/dataset-index.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40ca62069295695d896134b60e66b2260066c072
--- /dev/null
+++ b/dataset-index.yml
@@ -0,0 +1,11 @@
+imagenet1k:
+ dataset: OpenDataLab/ImageNet-1K
+ download_root: data
+ data_root: data/imagenet
+ script: tools/dataset_converters/odl_imagenet1k_preprocess.sh
+
+cub:
+ dataset: OpenDataLab/CUB-200-2011
+ download_root: data
+ data_root: data/CUB_200_2011
+ script: tools/dataset_converters/odl_cub_preprocess.sh
diff --git a/demo/bird.JPEG b/demo/bird.JPEG
new file mode 100755
index 0000000000000000000000000000000000000000..9c132a099e87d1c3c1a76dfd9201b03801301eab
Binary files /dev/null and b/demo/bird.JPEG differ
diff --git a/demo/cat-dog.png b/demo/cat-dog.png
new file mode 100644
index 0000000000000000000000000000000000000000..2ddd0fdb2e6c9269a9739d525a8feae05af2ee5f
Binary files /dev/null and b/demo/cat-dog.png differ
diff --git a/demo/demo.JPEG b/demo/demo.JPEG
new file mode 100755
index 0000000000000000000000000000000000000000..fd3a93f59385d6ff632483646e6caee300b56d09
Binary files /dev/null and b/demo/demo.JPEG differ
diff --git a/demo/dog.jpg b/demo/dog.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..c68fb054ad2dd2e5968a866c3140849c84b5484b
Binary files /dev/null and b/demo/dog.jpg differ
diff --git a/demo/image_demo.py b/demo/image_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..015873506ce86a4af2d68e9df9b50e6afe5ec6bc
--- /dev/null
+++ b/demo/image_demo.py
@@ -0,0 +1,44 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from argparse import ArgumentParser
+
+from mmengine.fileio import dump
+from rich import print_json
+
+from mmpretrain.apis import ImageClassificationInferencer
+
+
+def main():
+ parser = ArgumentParser()
+ parser.add_argument('img', help='Image file')
+ parser.add_argument('model', help='Model name or config file path')
+ parser.add_argument('--checkpoint', help='Checkpoint file path.')
+ parser.add_argument(
+ '--show',
+ action='store_true',
+ help='Whether to show the prediction result in a window.')
+ parser.add_argument(
+ '--show-dir',
+ type=str,
+ help='The directory to save the visualization image.')
+ parser.add_argument('--device', help='Device used for inference')
+ args = parser.parse_args()
+
+ # build the model from a config file and a checkpoint file
+ try:
+ pretrained = args.checkpoint or True
+ inferencer = ImageClassificationInferencer(
+ args.model, pretrained=pretrained)
+ except ValueError:
+ raise ValueError(
+ f'Unavailable model "{args.model}", you can specify find a model '
+ 'name or a config file or find a model name from '
+ 'https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html#all-checkpoints' # noqa: E501
+ )
+ result = inferencer(args.img, show=args.show, show_dir=args.show_dir)[0]
+ # show the results
+ result.pop('pred_scores') # pred_scores is too verbose for a demo.
+ print_json(dump(result, file_format='json', indent=4))
+
+
+if __name__ == '__main__':
+ main()
diff --git a/demo/ipu_train_example.sh b/demo/ipu_train_example.sh
new file mode 100644
index 0000000000000000000000000000000000000000..94c8456d97897a717166d83fb4a494a8a61bfceb
--- /dev/null
+++ b/demo/ipu_train_example.sh
@@ -0,0 +1,9 @@
+
+
+# get SOTA accuracy 81.2 for 224 input ViT fine-tuning, reference is below:
+# https://github.com/google-research/vision_transformer#available-vit-models
+# cfg: vit-base-p16_ft-4xb544_in1k-224_ipu train model in fp16 precision
+# 8 epoch, 2176 batch size, 16 IPUs, 4 replicas, model Tput = 5600 images, training time 0.6 hour roughly
+cfg_name=vit-base-p16_ft-4xb544_in1k-224_ipu
+python3 tools/train.py configs/vision_transformer/${cfg_name}.py --ipu-replicas 4 --no-validate &&
+python3 tools/test.py configs/vision_transformer/${cfg_name}.py work_dirs/${cfg_name}/latest.pth --metrics accuracy --device ipu
diff --git a/docker/Dockerfile b/docker/Dockerfile
index f81a9f52839c74c109b7541d5d50e264f1ec8838..5f7df525ceb364d1d0dff72520bf9f75bf05f791 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -1,5 +1,26 @@
-FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.10.0-centos7.6-dtk-22.10.1-py37-latest
-ENV DEBIAN_FRONTEND=noninteractive
-# 安装pip相关依赖
-COPY requirements.txt requirements.txt
-RUN pip3 install -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com -r requirements.txt
+ARG PYTORCH="1.12.1"
+ARG CUDA="11.3"
+ARG CUDNN="8"
+
+FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
+
+# fetch the key refer to https://forums.developer.nvidia.com/t/18-04-cuda-docker-image-is-broken/212892/9
+RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub 32
+RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub
+
+ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
+ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
+ENV CMAKE_PREFIX_PATH="(dirname(which conda))/../"
+
+RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
+ && apt-get clean \
+ && rm -rf /var/lib/apt/lists/*
+
+# Install MIM
+RUN pip install openmim
+
+# Install MMPretrain
+RUN conda clean --all
+RUN git clone https://github.com/open-mmlab/mmpretrain.git
+WORKDIR ./mmpretrain
+RUN mim install --no-cache-dir -e .
diff --git a/docker/requirements.txt b/docker/requirements.txt
deleted file mode 100644
index 88a2e62c0a93da425e87407f3cb98b93a0ba7216..0000000000000000000000000000000000000000
--- a/docker/requirements.txt
+++ /dev/null
@@ -1,15 +0,0 @@
-albumentations>=0.3.2 --no-binary qudida,albumentations
-colorama
-requests
-rich
-scipy
-matplotlib>=3.1.0
-numpy
-packaging
-codecov
-flake8
-interrogate
-isort==4.3.21
-pytest
-xdoctest >= 0.10.0
-yapf
diff --git a/docker/serve/Dockerfile b/docker/serve/Dockerfile
new file mode 100644
index 0000000000000000000000000000000000000000..c50c4e8ee829eace217e0991d10002ad4e4589da
--- /dev/null
+++ b/docker/serve/Dockerfile
@@ -0,0 +1,37 @@
+ARG PYTORCH="2.0.1"
+ARG CUDA="11.7"
+ARG CUDNN="8"
+FROM pytorch/torchserve:latest-gpu
+
+ARG MMPRE="1.2.0"
+
+ENV PYTHONUNBUFFERED TRUE
+
+ENV HOME="/home/model-server"
+ENV PATH="/opt/conda/bin:$HOME/.local/bin:$PATH"
+RUN export FORCE_CUDA=1
+
+# TORCHSEVER
+RUN pip install torchserve torch-model-archiver
+RUN pip install nvgpu
+
+# OPEN-MMLAB
+ARG PYTORCH
+ARG CUDA
+RUN pip install openmim
+RUN mim install mmpretrain==${MMPRE}
+RUN mkdir -p $HOME/tmp
+
+COPY --chown=model-server entrypoint.sh $HOME/.local/bin/entrypoint.sh
+
+RUN chmod +x $HOME/.local/bin/entrypoint.sh
+
+COPY --chown=model-server config.properties $HOME/config.properties
+
+EXPOSE 8080 8081 8082
+
+USER model-server
+WORKDIR $HOME
+ENV TEMP=$HOME/tmp
+ENTRYPOINT ["/home/model-server/.local/bin/entrypoint.sh"]
+CMD ["serve"]
diff --git a/docker/serve/config.properties b/docker/serve/config.properties
new file mode 100644
index 0000000000000000000000000000000000000000..efb9c47e40ab550bac765611e6c6c6f2a7152f11
--- /dev/null
+++ b/docker/serve/config.properties
@@ -0,0 +1,5 @@
+inference_address=http://0.0.0.0:8080
+management_address=http://0.0.0.0:8081
+metrics_address=http://0.0.0.0:8082
+model_store=/home/model-server/model-store
+load_models=all
diff --git a/docker/serve/entrypoint.sh b/docker/serve/entrypoint.sh
new file mode 100644
index 0000000000000000000000000000000000000000..41ba00b048aed84b45c5a8015a016ff148e97d86
--- /dev/null
+++ b/docker/serve/entrypoint.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+set -e
+
+if [[ "$1" = "serve" ]]; then
+ shift 1
+ torchserve --start --ts-config /home/model-server/config.properties
+else
+ eval "$@"
+fi
+
+# prevent docker exit
+tail -f /dev/null
diff --git a/docs/en/Makefile b/docs/en/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..d4bb2cbb9eddb1bb1b4f366623044af8e4830919
--- /dev/null
+++ b/docs/en/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS ?=
+SPHINXBUILD ?= sphinx-build
+SOURCEDIR = .
+BUILDDIR = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+ @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/en/_static/css/readthedocs.css b/docs/en/_static/css/readthedocs.css
new file mode 100644
index 0000000000000000000000000000000000000000..4c7fa98fa8d80fbff62c508002aa5f65520195e9
--- /dev/null
+++ b/docs/en/_static/css/readthedocs.css
@@ -0,0 +1,62 @@
+.header-logo {
+ background-image: url("../image/mmpt-logo.png");
+ background-size: 183px 50px;
+ height: 50px;
+ width: 183px;
+}
+
+@media screen and (min-width: 1100px) {
+ .header-logo {
+ top: -12px;
+ }
+}
+
+pre {
+ white-space: pre;
+}
+
+@media screen and (min-width: 2000px) {
+ .pytorch-content-left {
+ width: 1200px;
+ margin-left: 30px;
+ }
+ article.pytorch-article {
+ max-width: 1200px;
+ }
+ .pytorch-breadcrumbs-wrapper {
+ width: 1200px;
+ }
+ .pytorch-right-menu.scrolling-fixed {
+ position: fixed;
+ top: 45px;
+ left: 1580px;
+ }
+}
+
+
+article.pytorch-article section code {
+ padding: .2em .4em;
+ background-color: #f3f4f7;
+ border-radius: 5px;
+}
+
+/* Disable the change in tables */
+article.pytorch-article section table code {
+ padding: unset;
+ background-color: unset;
+ border-radius: unset;
+}
+
+table.autosummary td {
+ width: 50%
+}
+
+img.align-center {
+ display: block;
+ margin-left: auto;
+ margin-right: auto;
+}
+
+article.pytorch-article p.rubric {
+ font-weight: bold;
+}
diff --git a/docs/en/_static/image/confusion-matrix.png b/docs/en/_static/image/confusion-matrix.png
new file mode 100755
index 0000000000000000000000000000000000000000..a1dc7ba6a73700ff55f81e40d00bc16f4da26b31
Binary files /dev/null and b/docs/en/_static/image/confusion-matrix.png differ
diff --git a/docs/en/_static/image/mmpt-logo.png b/docs/en/_static/image/mmpt-logo.png
new file mode 100644
index 0000000000000000000000000000000000000000..f4e060716520ece5db7e85df3c3ad8fd9e0eda57
Binary files /dev/null and b/docs/en/_static/image/mmpt-logo.png differ
diff --git a/docs/en/_static/image/tools/analysis/analyze_log.jpg b/docs/en/_static/image/tools/analysis/analyze_log.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..8eb1a27d6464d255b84b23a7460a5f622f51712f
Binary files /dev/null and b/docs/en/_static/image/tools/analysis/analyze_log.jpg differ
diff --git a/docs/en/_static/js/custom.js b/docs/en/_static/js/custom.js
new file mode 100644
index 0000000000000000000000000000000000000000..3eec9f46f8d3d360c0dcc256ddddd65b456e9553
--- /dev/null
+++ b/docs/en/_static/js/custom.js
@@ -0,0 +1,10 @@
+var collapsedSections = ['Advanced Guides', 'Model Zoo', 'Visualization', 'Analysis Tools', 'Deployment', 'Notes'];
+
+$(document).ready(function () {
+ $('.model-summary').DataTable({
+ "stateSave": false,
+ "lengthChange": false,
+ "pageLength": 20,
+ "order": []
+ });
+});
diff --git a/docs/en/_templates/404.html b/docs/en/_templates/404.html
new file mode 100644
index 0000000000000000000000000000000000000000..639d255989a87263c1d8a07df2312e1882104e90
--- /dev/null
+++ b/docs/en/_templates/404.html
@@ -0,0 +1,18 @@
+{% extends "layout.html" %}
+
+{% block body %}
+
+Page Not Found
+
+ The page you are looking for cannot be found.
+
+
+ If you just switched documentation versions, it is likely that the page you were on is moved. You can look for it in
+ the content table left, or go to the homepage.
+
+
+ If you cannot find documentation you want, please open an issue to tell us!
+
+
+{% endblock %}
diff --git a/docs/en/_templates/autosummary/class.rst b/docs/en/_templates/autosummary/class.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4c3a7a9abf5c5b14ac3ef3b00a2f070480295358
--- /dev/null
+++ b/docs/en/_templates/autosummary/class.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+ :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+ :members:
+
+..
+ autogenerated from _templates/autosummary/class.rst
+ note it does not have :inherited-members:
diff --git a/docs/en/_templates/callable.rst b/docs/en/_templates/callable.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3a7b9d2b96c76dfa3eb1d8bef56f58f219fe7760
--- /dev/null
+++ b/docs/en/_templates/callable.rst
@@ -0,0 +1,14 @@
+.. role:: hidden
+ :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+ :members:
+ :special-members: __call__
+
+..
+ autogenerated from _templates/callable.rst
+ note it does not have :inherited-members:
diff --git a/docs/en/_templates/data_transform.rst b/docs/en/_templates/data_transform.rst
new file mode 100644
index 0000000000000000000000000000000000000000..376bfe9db6c305e681f265dd0e20b7b7ea6e500f
--- /dev/null
+++ b/docs/en/_templates/data_transform.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+ :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+ :members: transform
+
+..
+ autogenerated from _templates/callable.rst
+ note it does not have :inherited-members:
diff --git a/docs/en/advanced_guides/convention.md b/docs/en/advanced_guides/convention.md
new file mode 100644
index 0000000000000000000000000000000000000000..9edd04c1d5685aaa353e10d04e7a609d9fc9adf4
--- /dev/null
+++ b/docs/en/advanced_guides/convention.md
@@ -0,0 +1,120 @@
+# Convention in MMPretrain
+
+## Model Naming Convention
+
+We follow the below convention to name models. Contributors are advised to follow the same style. The model names are divided into five parts: algorithm info, module information, pretrain information, training information and data information. Logically, different parts are concatenated by underscores `'_'`, and words in the same part are concatenated by dashes `'-'`.
+
+```text
+{algorithm info}_{module info}_{pretrain info}_{training info}_{data info}
+```
+
+- `algorithm info` (optional): The main algorithm information, it's includes the main training algorithms like MAE, BEiT, etc.
+- `module info`: The module information, it usually includes the backbone name, such as resnet, vit, etc.
+- `pretrain info`: (optional): The pretrain model information, such as the pretrain model is trained on ImageNet-21k.
+- `training info`: The training information, some training schedule, including batch size, lr schedule, data augment and the like.
+- `data info`: The data information, it usually includes the dataset name, input size and so on, such as imagenet, cifar, etc.
+
+### Algorithm information
+
+The main algorithm name to train the model. For example:
+
+- `simclr`
+- `mocov2`
+- `eva-mae-style`
+
+The model trained by supervised image classification can omit this field.
+
+### Module information
+
+The modules of the model, usually, the backbone must be included in this field, and the neck and head
+information can be omitted. For example:
+
+- `resnet50`
+- `vit-base-p16`
+- `swin-base`
+
+### Pretrain information
+
+If the model is a fine-tuned model from a pre-trained model, we need to record some information of the
+pre-trained model. For example:
+
+- The source of the pre-trained model: `fb`, `openai`, etc.
+- The method to train the pre-trained model: `clip`, `mae`, `distill`, etc.
+- The dataset used for pre-training: `in21k`, `laion2b`, etc. (`in1k` can be omitted.)
+- The training duration: `300e`, `1600e`, etc.
+
+Not all information is necessary, only select the necessary information to distinguish different pre-trained
+models.
+
+At the end of this field, use a `-pre` as an identifier, like `mae-in21k-pre`.
+
+### Training information
+
+Training schedule, including training type, `batch size`, `lr schedule`, data augment, special loss functions and so on:
+
+- format `{gpu x batch_per_gpu}`, such as `8xb32`
+
+Training type (mainly seen in the transformer network, such as the `ViT` algorithm, which is usually divided into two training type: pre-training and fine-tuning):
+
+- `ft` : configuration file for fine-tuning
+- `pt` : configuration file for pretraining
+
+Training recipe. Usually, only the part that is different from the original paper will be marked. These methods will be arranged in the order `{pipeline aug}-{train aug}-{loss trick}-{scheduler}-{epochs}`.
+
+- `coslr-200e` : use cosine scheduler to train 200 epochs
+- `autoaug-mixup-lbs-coslr-50e` : use `autoaug`, `mixup`, `label smooth`, `cosine scheduler` to train 50 epochs
+
+If the model is converted from a third-party repository like the official repository, the training information
+can be omitted and use a `3rdparty` as an identifier.
+
+### Data information
+
+- `in1k` : `ImageNet1k` dataset, default to use the input image size of 224x224;
+- `in21k` : `ImageNet21k` dataset, also called `ImageNet22k` dataset, default to use the input image size of 224x224;
+- `in1k-384px` : Indicates that the input image size is 384x384;
+- `cifar100`
+
+### Model Name Example
+
+```text
+vit-base-p32_clip-openai-pre_3rdparty_in1k
+```
+
+- `vit-base-p32`: The module information
+- `clip-openai-pre`: The pre-train information.
+ - `clip`: The pre-train method is clip.
+ - `openai`: The pre-trained model is come from OpenAI.
+ - `pre`: The pre-train identifier.
+- `3rdparty`: The model is converted from a third-party repository.
+- `in1k`: Dataset information. The model is trained from ImageNet-1k dataset and the input size is `224x224`.
+
+```text
+beit_beit-base-p16_8xb256-amp-coslr-300e_in1k
+```
+
+- `beit`: The algorithm information
+- `beit-base`: The module information, since the backbone is a modified ViT from BEiT, the backbone name is
+ also `beit`.
+- `8xb256-amp-coslr-300e`: The training information.
+ - `8xb256`: Use 8 GPUs and the batch size on each GPU is 256.
+ - `amp`: Use automatic-mixed-precision training.
+ - `coslr`: Use cosine annealing learning rate scheduler.
+ - `300e`: To train 300 epochs.
+- `in1k`: Dataset information. The model is trained from ImageNet-1k dataset and the input size is `224x224`.
+
+## Config File Naming Convention
+
+The naming of the config file is almost the same with the model name, with several difference:
+
+- The training information is necessary, and cannot be `3rdparty`.
+- If the config file only includes backbone settings, without neither head settings nor dataset settings. We
+ will name it as `{module info}_headless.py`. This kind of config files are usually used for third-party
+ pre-trained models on large datasets.
+
+## Checkpoint Naming Convention
+
+The naming of the weight mainly includes the model name, date and hash value.
+
+```text
+{model_name}_{date}-{hash}.pth
+```
diff --git a/docs/en/advanced_guides/datasets.md b/docs/en/advanced_guides/datasets.md
new file mode 100644
index 0000000000000000000000000000000000000000..1a018e441a1a1e820b02602dec0f85f553ec8eb0
--- /dev/null
+++ b/docs/en/advanced_guides/datasets.md
@@ -0,0 +1,72 @@
+# Adding New Dataset
+
+You can write a new dataset class inherited from `BaseDataset`, and overwrite `load_data_list(self)`,
+like [CIFAR10](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/cifar.py) and [ImageNet](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/imagenet.py).
+Typically, this function returns a list, where each sample is a dict, containing necessary data information, e.g., `img` and `gt_label`.
+
+Assume we are going to implement a `Filelist` dataset, which takes filelists for both training and testing. The format of annotation list is as follows:
+
+```text
+000001.jpg 0
+000002.jpg 1
+```
+
+## 1. Create Dataset Class
+
+We can create a new dataset in `mmpretrain/datasets/filelist.py` to load the data.
+
+```python
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class Filelist(BaseDataset):
+
+ def load_data_list(self):
+ assert isinstance(self.ann_file, str),
+
+ data_list = []
+ with open(self.ann_file) as f:
+ samples = [x.strip().split(' ') for x in f.readlines()]
+ for filename, gt_label in samples:
+ img_path = add_prefix(filename, self.img_prefix)
+ info = {'img_path': img_path, 'gt_label': int(gt_label)}
+ data_list.append(info)
+ return data_list
+```
+
+## 2. Add to the package
+
+And add this dataset class in `mmpretrain/datasets/__init__.py`
+
+```python
+from .base_dataset import BaseDataset
+...
+from .filelist import Filelist
+
+__all__ = [
+ 'BaseDataset', ... ,'Filelist'
+]
+```
+
+## 3. Modify Related Config
+
+Then in the config, to use `Filelist` you can modify the config as the following
+
+```python
+train_dataloader = dict(
+ ...
+ dataset=dict(
+ type='Filelist',
+ ann_file='image_list.txt',
+ pipeline=train_pipeline,
+ )
+)
+```
+
+All the dataset classes inherit from [`BaseDataset`](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/base_dataset.py) have **lazy loading** and **memory saving** features, you can refer to related documents of {external+mmengine:doc}`BaseDataset `.
+
+```{note}
+If the dictionary of the data sample contains 'img_path' but not 'img', then 'LoadImgFromFile' transform must be added in the pipeline.
+```
diff --git a/docs/en/advanced_guides/evaluation.md b/docs/en/advanced_guides/evaluation.md
new file mode 100644
index 0000000000000000000000000000000000000000..d7978eafe02bfd09d1003bd5e1a6516a3b7020d6
--- /dev/null
+++ b/docs/en/advanced_guides/evaluation.md
@@ -0,0 +1,103 @@
+# Customize Evaluation Metrics
+
+## Use metrics in MMPretrain
+
+In MMPretrain, we have provided multiple metrics for both single-label classification and multi-label
+classification:
+
+**Single-label Classification**:
+
+- [`Accuracy`](mmpretrain.evaluation.Accuracy)
+- [`SingleLabelMetric`](mmpretrain.evaluation.SingleLabelMetric), including precision, recall, f1-score and
+ support.
+
+**Multi-label Classification**:
+
+- [`AveragePrecision`](mmpretrain.evaluation.AveragePrecision), or AP (mAP).
+- [`MultiLabelMetric`](mmpretrain.evaluation.MultiLabelMetric), including precision, recall, f1-score and
+ support.
+
+To use these metrics during validation and testing, we need to modify the `val_evaluator` and `test_evaluator`
+fields in the config file.
+
+Here is several examples:
+
+1. Calculate top-1 and top-5 accuracy during both validation and test.
+
+ ```python
+ val_evaluator = dict(type='Accuracy', topk=(1, 5))
+ test_evaluator = val_evaluator
+ ```
+
+2. Calculate top-1 accuracy, top-5 accuracy, precision and recall during both validation and test.
+
+ ```python
+ val_evaluator = [
+ dict(type='Accuracy', topk=(1, 5)),
+ dict(type='SingleLabelMetric', items=['precision', 'recall']),
+ ]
+ test_evaluator = val_evaluator
+ ```
+
+3. Calculate mAP (mean AveragePrecision), CP (Class-wise mean Precision), CR (Class-wise mean Recall), CF
+ (Class-wise mean F1-score), OP (Overall mean Precision), OR (Overall mean Recall) and OF1 (Overall mean
+ F1-score).
+
+ ```python
+ val_evaluator = [
+ dict(type='AveragePrecision'),
+ dict(type='MultiLabelMetric', average='macro'), # class-wise mean
+ dict(type='MultiLabelMetric', average='micro'), # overall mean
+ ]
+ test_evaluator = val_evaluator
+ ```
+
+## Add new metrics
+
+MMPretrain supports the implementation of customized evaluation metrics for users who pursue higher customization.
+
+You need to create a new file under `mmpretrain/evaluation/metrics`, and implement the new metric in the file, for example, in `mmpretrain/evaluation/metrics/my_metric.py`. And create a customized evaluation metric class `MyMetric` which inherits [`BaseMetric in MMEngine`](mmengine.evaluator.BaseMetric).
+
+The data format processing method `process` and the metric calculation method `compute_metrics` need to be overwritten respectively. Add it to the `METRICS` registry to implement any customized evaluation metric.
+
+```python
+from mmengine.evaluator import BaseMetric
+from mmpretrain.registry import METRICS
+
+@METRICS.register_module()
+class MyMetric(BaseMetric):
+
+ def process(self, data_batch: Sequence[Dict], data_samples: Sequence[Dict]):
+ """ The processed results should be stored in ``self.results``, which will
+ be used to computed the metrics when all batches have been processed.
+ `data_batch` stores the batch data from dataloader,
+ and `data_samples` stores the batch outputs from model.
+ """
+ ...
+
+ def compute_metrics(self, results: List):
+ """ Compute the metrics from processed results and returns the evaluation results.
+ """
+ ...
+```
+
+Then, import it in the `mmpretrain/evaluation/metrics/__init__.py` to add it into the `mmpretrain.evaluation` package.
+
+```python
+# In mmpretrain/evaluation/metrics/__init__.py
+...
+from .my_metric import MyMetric
+
+__all__ = [..., 'MyMetric']
+```
+
+Finally, use `MyMetric` in the `val_evaluator` and `test_evaluator` field of config files.
+
+```python
+val_evaluator = dict(type='MyMetric', ...)
+test_evaluator = val_evaluator
+```
+
+```{note}
+More details can be found in {external+mmengine:doc}`MMEngine Documentation: Evaluation `.
+```
diff --git a/docs/en/advanced_guides/modules.md b/docs/en/advanced_guides/modules.md
new file mode 100644
index 0000000000000000000000000000000000000000..fb34aedec2c7f2940504f307351f80305f1ee441
--- /dev/null
+++ b/docs/en/advanced_guides/modules.md
@@ -0,0 +1,511 @@
+# Customize Models
+
+In our design, a complete model is defined as a top-level module which contains several model components based on their functionalities.
+
+- model: a top-level module defines the type of the task, such as `ImageClassifier` for image classification, `MAE` for self-supervised leanrning, `ImageToImageRetriever` for image retrieval.
+- backbone: usually a feature extraction network that records the major differences between models, e.g., `ResNet`, `MobileNet`.
+- neck: the component between backbone and head, e.g., `GlobalAveragePooling`.
+- head: the component for specific tasks, e.g., `ClsHead`, `ContrastiveHead`.
+- loss: the component in the head for calculating losses, e.g., `CrossEntropyLoss`, `LabelSmoothLoss`.
+- target_generator: the component for self-supervised leanrning task specifically, e.g., `VQKD`, `HOGGenerator`.
+
+## Add a new model
+
+Generally, for image classification and retrieval tasks, the pipelines are consistent. However, the pipelines are different from each self-supervised leanrning algorithms, like `MAE` and `BEiT`. Thus, in this section, we will explain how to add your self-supervised learning algorithm.
+
+### Add a new self-supervised learning algorithm
+
+1. Create a new file `mmpretrain/models/selfsup/new_algorithm.py` and implement `NewAlgorithm` in it.
+
+ ```python
+ from mmpretrain.registry import MODELS
+ from .base import BaseSelfSupvisor
+
+
+ @MODELS.register_module()
+ class NewAlgorithm(BaseSelfSupvisor):
+
+ def __init__(self, backbone, neck=None, head=None, init_cfg=None):
+ super().__init__(init_cfg)
+ pass
+
+ # ``extract_feat`` function is defined in BaseSelfSupvisor, you could
+ # overwrite it if needed
+ def extract_feat(self, inputs, **kwargs):
+ pass
+
+ # the core function to compute the loss
+ def loss(self, inputs, data_samples, **kwargs):
+ pass
+
+ ```
+
+2. Import the new algorithm module in `mmpretrain/models/selfsup/__init__.py`
+
+ ```python
+ ...
+ from .new_algorithm import NewAlgorithm
+
+ __all__ = [
+ ...,
+ 'NewAlgorithm',
+ ...
+ ]
+ ```
+
+3. Use it in your config file.
+
+ ```python
+ model = dict(
+ type='NewAlgorithm',
+ backbone=...,
+ neck=...,
+ head=...,
+ ...
+ )
+ ```
+
+## Add a new backbone
+
+Here we present how to develop a new backbone component by an example of `ResNet_CIFAR`.
+As the input size of CIFAR is 32x32, which is much smaller than the default size of 224x224 in ImageNet, this backbone replaces the `kernel_size=7, stride=2` to `kernel_size=3, stride=1` and removes the MaxPooling after the stem layer to avoid forwarding small feature maps to residual blocks.
+
+The easiest way is to inherit from `ResNet` and only modify the stem layer.
+
+1. Create a new file `mmpretrain/models/backbones/resnet_cifar.py`.
+
+ ```python
+ import torch.nn as nn
+
+ from mmpretrain.registry import MODELS
+ from .resnet import ResNet
+
+
+ @MODELS.register_module()
+ class ResNet_CIFAR(ResNet):
+
+ """ResNet backbone for CIFAR.
+
+ short description of the backbone
+
+ Args:
+ depth(int): Network depth, from {18, 34, 50, 101, 152}.
+ ...
+ """
+
+ def __init__(self, depth, deep_stem, **kwargs):
+ # call ResNet init
+ super(ResNet_CIFAR, self).__init__(depth, deep_stem=deep_stem, **kwargs)
+ # other specific initializations
+ assert not self.deep_stem, 'ResNet_CIFAR do not support deep_stem'
+
+ def _make_stem_layer(self, in_channels, base_channels):
+ # override the ResNet method to modify the network structure
+ self.conv1 = build_conv_layer(
+ self.conv_cfg,
+ in_channels,
+ base_channels,
+ kernel_size=3,
+ stride=1,
+ padding=1,
+ bias=False)
+ self.norm1_name, norm1 = build_norm_layer(
+ self.norm_cfg, base_channels, postfix=1)
+ self.add_module(self.norm1_name, norm1)
+ self.relu = nn.ReLU(inplace=True)
+
+ def forward(self, x):
+ # Customize the forward method if needed.
+ x = self.conv1(x)
+ x = self.norm1(x)
+ x = self.relu(x)
+ outs = []
+ for i, layer_name in enumerate(self.res_layers):
+ res_layer = getattr(self, layer_name)
+ x = res_layer(x)
+ if i in self.out_indices:
+ outs.append(x)
+ # The return value needs to be a tuple with multi-scale outputs from different depths.
+ # If you don't need multi-scale features, just wrap the output as a one-item tuple.
+ return tuple(outs)
+
+ def init_weights(self):
+ # Customize the weight initialization method if needed.
+ super().init_weights()
+
+ # Disable the weight initialization if loading a pretrained model.
+ if self.init_cfg is not None and self.init_cfg['type'] == 'Pretrained':
+ return
+
+ # Usually, we recommend using `init_cfg` to specify weight initialization methods
+ # of convolution, linear, or normalization layers. If you have some special needs,
+ # do these extra weight initialization here.
+ ...
+ ```
+
+```{note}
+Replace original registry names from `BACKBONES`, `NECKS`, `HEADS` and `LOSSES` to `MODELS` in OpenMMLab 2.0 design.
+```
+
+2. Import the new backbone module in `mmpretrain/models/backbones/__init__.py`.
+
+ ```python
+ ...
+ from .resnet_cifar import ResNet_CIFAR
+
+ __all__ = [
+ ..., 'ResNet_CIFAR'
+ ]
+ ```
+
+3. Modify the correlated settings in your config file.
+
+ ```python
+ model = dict(
+ ...
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=18,
+ ...),
+ ...
+ ```
+
+### Add a new backbone for self-supervised learning
+
+For some self-supervised learning algorithms, the backbones are kind of different, such as `MAE`, `BEiT`, etc. Their backbones need to deal with `mask` in order to extract features from visible tokens.
+
+Take [MAEViT](mmpretrain.models.selfsup.MAEViT) as an example, we need to overwrite `forward` function to compute with `mask`. We also defines `init_weights` to initialize parameters and `random_masking` to generate mask for `MAE` pre-training.
+
+```python
+class MAEViT(VisionTransformer):
+ """Vision Transformer for MAE pre-training"""
+
+ def __init__(mask_ratio, **kwargs) -> None:
+ super().__init__(**kwargs)
+ # position embedding is not learnable during pretraining
+ self.pos_embed.requires_grad = False
+ self.mask_ratio = mask_ratio
+ self.num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+ def init_weights(self) -> None:
+ """Initialize position embedding, patch embedding and cls token."""
+ super().init_weights()
+ # define what if needed
+ pass
+
+ def random_masking(
+ self,
+ x: torch.Tensor,
+ mask_ratio: float = 0.75
+ ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+ """Generate the mask for MAE Pre-training."""
+ pass
+
+ def forward(
+ self,
+ x: torch.Tensor,
+ mask: Optional[bool] = True
+ ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+ """Generate features for masked images.
+
+ The function supports two kind of forward behaviors. If the ``mask`` is
+ ``True``, the function will generate mask to masking some patches
+ randomly and get the hidden features for visible patches, which means
+ the function will be executed as masked imagemodeling pre-training;
+ if the ``mask`` is ``None`` or ``False``, the forward function will
+ call ``super().forward()``, which extract features from images without
+ mask.
+ """
+ if mask is None or False:
+ return super().forward(x)
+
+ else:
+ B = x.shape[0]
+ x = self.patch_embed(x)[0]
+ # add pos embed w/o cls token
+ x = x + self.pos_embed[:, 1:, :]
+
+ # masking: length -> length * mask_ratio
+ x, mask, ids_restore = self.random_masking(x, self.mask_ratio)
+
+ # append cls token
+ cls_token = self.cls_token + self.pos_embed[:, :1, :]
+ cls_tokens = cls_token.expand(B, -1, -1)
+ x = torch.cat((cls_tokens, x), dim=1)
+
+ for _, layer in enumerate(self.layers):
+ x = layer(x)
+ # Use final norm
+ x = self.norm1(x)
+
+ return (x, mask, ids_restore)
+
+```
+
+## Add a new neck
+
+Here we take `GlobalAveragePooling` as an example. It is a very simple neck without any arguments.
+To add a new neck, we mainly implement the `forward` function, which applies some operations on the output from the backbone and forwards the results to the head.
+
+1. Create a new file in `mmpretrain/models/necks/gap.py`.
+
+ ```python
+ import torch.nn as nn
+
+ from mmpretrain.registry import MODELS
+
+ @MODELS.register_module()
+ class GlobalAveragePooling(nn.Module):
+
+ def __init__(self):
+ self.gap = nn.AdaptiveAvgPool2d((1, 1))
+
+ def forward(self, inputs):
+ # we regard inputs as tensor for simplicity
+ outs = self.gap(inputs)
+ outs = outs.view(inputs.size(0), -1)
+ return outs
+ ```
+
+2. Import the new neck module in `mmpretrain/models/necks/__init__.py`.
+
+ ```python
+ ...
+ from .gap import GlobalAveragePooling
+
+ __all__ = [
+ ..., 'GlobalAveragePooling'
+ ]
+ ```
+
+3. Modify the correlated settings in your config file.
+
+ ```python
+ model = dict(
+ neck=dict(type='GlobalAveragePooling'),
+ )
+ ```
+
+## Add a new head
+
+### Based on ClsHead
+
+Here we present how to develop a new head by the example of simplified `VisionTransformerClsHead` as the following.
+To implement a new head, we need to implement a `pre_logits` method for processes before the final classification head and a `forward` method.
+
+:::{admonition} Why do we need the `pre_logits` method?
+:class: note
+
+In classification tasks, we usually use a linear layer to do the final classification. And sometimes, we need
+to obtain the feature before the final classification, which is the output of the `pre_logits` method.
+:::
+
+1. Create a new file in `mmpretrain/models/heads/vit_head.py`.
+
+ ```python
+ import torch.nn as nn
+
+ from mmpretrain.registry import MODELS
+ from .cls_head import ClsHead
+
+
+ @MODELS.register_module()
+ class VisionTransformerClsHead(ClsHead):
+
+ def __init__(self, num_classes, in_channels, hidden_dim, **kwargs):
+ super().__init__(**kwargs)
+ self.in_channels = in_channels
+ self.num_classes = num_classes
+ self.hidden_dim = hidden_dim
+
+ self.fc1 = nn.Linear(in_channels, hidden_dim)
+ self.act = nn.Tanh()
+ self.fc2 = nn.Linear(hidden_dim, num_classes)
+
+ def pre_logits(self, feats):
+ # The output of the backbone is usually a tuple from multiple depths,
+ # and for classification, we only need the final output.
+ feat = feats[-1]
+
+ # The final output of VisionTransformer is a tuple of patch tokens and
+ # classification tokens. We need classification tokens here.
+ _, cls_token = feat
+
+ # Do all works except the final classification linear layer.
+ return self.act(self.fc1(cls_token))
+
+ def forward(self, feats):
+ pre_logits = self.pre_logits(feats)
+
+ # The final classification linear layer.
+ cls_score = self.fc2(pre_logits)
+ return cls_score
+ ```
+
+2. Import the module in `mmpretrain/models/heads/__init__.py`.
+
+ ```python
+ ...
+ from .vit_head import VisionTransformerClsHead
+
+ __all__ = [
+ ..., 'VisionTransformerClsHead'
+ ]
+ ```
+
+3. Modify the correlated settings in your config file.
+
+ ```python
+ model = dict(
+ head=dict(
+ type='VisionTransformerClsHead',
+ ...,
+ ))
+ ```
+
+### Based on BaseModule
+
+Here is an example of `MAEPretrainHead`, which is based on `BaseModule` and implemented for mask image modeling task. It is required to implement `loss` function to generate loss, but the other helper functions are optional.
+
+```python
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class MAEPretrainHead(BaseModule):
+ """Head for MAE Pre-training."""
+
+ def __init__(self,
+ loss: dict,
+ norm_pix: bool = False,
+ patch_size: int = 16) -> None:
+ super().__init__()
+ self.norm_pix = norm_pix
+ self.patch_size = patch_size
+ self.loss_module = MODELS.build(loss)
+
+ def patchify(self, imgs: torch.Tensor) -> torch.Tensor:
+ """Split images into non-overlapped patches."""
+ p = self.patch_size
+ assert imgs.shape[2] == imgs.shape[3] and imgs.shape[2] % p == 0
+
+ h = w = imgs.shape[2] // p
+ x = imgs.reshape(shape=(imgs.shape[0], 3, h, p, w, p))
+ x = torch.einsum('nchpwq->nhwpqc', x)
+ x = x.reshape(shape=(imgs.shape[0], h * w, p**2 * 3))
+ return x
+
+ def construct_target(self, target: torch.Tensor) -> torch.Tensor:
+ """Construct the reconstruction target."""
+ target = self.patchify(target)
+ if self.norm_pix:
+ # normalize the target image
+ mean = target.mean(dim=-1, keepdim=True)
+ var = target.var(dim=-1, keepdim=True)
+ target = (target - mean) / (var + 1.e-6)**.5
+
+ return target
+
+ def loss(self, pred: torch.Tensor, target: torch.Tensor,
+ mask: torch.Tensor) -> torch.Tensor:
+ """Generate loss."""
+ target = self.construct_target(target)
+ loss = self.loss_module(pred, target, mask)
+
+ return loss
+```
+
+After implementation, the following step is the same as the step-2 and step-3 in [Based on ClsHead](#based-on-clshead)
+
+## Add a new loss
+
+To add a new loss function, we mainly implement the `forward` function in the loss module. We should register the loss module as `MODELS` as well.
+In addition, it is helpful to leverage the decorator `weighted_loss` to weight the loss for each element.
+Assuming that we want to mimic a probabilistic distribution generated from another classification model, we implement an L1Loss to fulfill the purpose as below.
+
+1. Create a new file in `mmpretrain/models/losses/l1_loss.py`.
+
+ ```python
+ import torch
+ import torch.nn as nn
+
+ from mmpretrain.registry import MODELS
+ from .utils import weighted_loss
+
+ @weighted_loss
+ def l1_loss(pred, target):
+ assert pred.size() == target.size() and target.numel() > 0
+ loss = torch.abs(pred - target)
+ return loss
+
+ @MODELS.register_module()
+ class L1Loss(nn.Module):
+
+ def __init__(self, reduction='mean', loss_weight=1.0):
+ super(L1Loss, self).__init__()
+ self.reduction = reduction
+ self.loss_weight = loss_weight
+
+ def forward(self,
+ pred,
+ target,
+ weight=None,
+ avg_factor=None,
+ reduction_override=None):
+ assert reduction_override in (None, 'none', 'mean', 'sum')
+ reduction = (
+ reduction_override if reduction_override else self.reduction)
+ loss = self.loss_weight * l1_loss(
+ pred, target, weight, reduction=reduction, avg_factor=avg_factor)
+ return loss
+ ```
+
+2. Import the module in `mmpretrain/models/losses/__init__.py`.
+
+ ```python
+ ...
+ from .l1_loss import L1Loss
+
+ __all__ = [
+ ..., 'L1Loss'
+ ]
+ ```
+
+3. Modify loss field in the head configs.
+
+ ```python
+ model = dict(
+ head=dict(
+ loss=dict(type='L1Loss', loss_weight=1.0),
+ ))
+ ```
+
+Finally, we can combine all the new model components in a config file to create a new model for best practices. Because `ResNet_CIFAR` is not a ViT-based backbone, we do not implement `VisionTransformerClsHead` here.
+
+```python
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=18,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=10,
+ in_channels=512,
+ loss=dict(type='L1Loss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
+
+```
+
+```{tip}
+For convenience, the same model components could inherit from existing config files, refers to [Learn about configs](../user_guides/config.md) for more details.
+```
diff --git a/docs/en/advanced_guides/pipeline.md b/docs/en/advanced_guides/pipeline.md
new file mode 100644
index 0000000000000000000000000000000000000000..058e8139c91b331762cee7090d0626004e645930
--- /dev/null
+++ b/docs/en/advanced_guides/pipeline.md
@@ -0,0 +1,170 @@
+# Customize Data Pipeline
+
+## Design of Data pipelines
+
+In the [new dataset tutorial](./datasets.md), we know that the dataset class use the `load_data_list` method
+to initialize the entire dataset, and we save the information of every sample to a dict.
+
+Usually, to save memory usage, we only load image paths and labels in the `load_data_list`, and load full
+image content when we use them. Moreover, we may want to do some random data augmentation during picking
+samples when training. Almost all data loading, pre-processing, and formatting operations can be configured in
+MMPretrain by the **data pipeline**.
+
+The data pipeline means how to process the sample dict when indexing a sample from the dataset. And it
+consists of a sequence of data transforms. Each data transform takes a dict as input, processes it, and outputs a
+dict for the next data transform.
+
+Here is a data pipeline example for ResNet-50 training on ImageNet.
+
+```python
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+```
+
+All available data transforms in MMPretrain can be found in the [data transforms docs](mmpretrain.datasets.transforms).
+
+## Modify the training/test pipeline
+
+The data pipeline in MMPretrain is pretty flexible. You can control almost every step of the data
+preprocessing from the config file, but on the other hand, you may be confused facing so many options.
+
+Here is a common practice and guidance for image classification tasks.
+
+### Loading
+
+At the beginning of a data pipeline, we usually need to load image data from the file path.
+[`LoadImageFromFile`](mmcv.transforms.LoadImageFromFile) is commonly used to do this task.
+
+```python
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ ...
+]
+```
+
+If you want to load data from files with special formats or special locations, you can [implement a new loading
+transform](#add-new-data-transforms) and add it at the beginning of the data pipeline.
+
+### Augmentation and other processing
+
+During training, we usually need to do data augmentation to avoid overfitting. During the test, we also need to do
+some data processing like resizing and cropping. These data transforms will be placed after the loading process.
+
+Here is a simple data augmentation recipe example. It will randomly resize and crop the input image to the
+specified scale, and randomly flip the image horizontally with probability.
+
+```python
+train_pipeline = [
+ ...
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ ...
+]
+```
+
+Here is a heavy data augmentation recipe example used in [Swin-Transformer](../papers/swin_transformer.md)
+training. To align with the official implementation, it specified `pillow` as the resize backend and `bicubic`
+as the resize algorithm. Moreover, it added [`RandAugment`](mmpretrain.datasets.transforms.RandAugment) and
+[`RandomErasing`](mmpretrain.datasets.transforms.RandomErasing) as extra data augmentation method.
+
+This configuration specified every detail of the data augmentation, and you can simply copy it to your own
+config file to apply the data augmentations of the Swin-Transformer.
+
+```python
+bgr_mean = [103.53, 116.28, 123.675]
+bgr_std = [57.375, 57.12, 58.395]
+
+train_pipeline = [
+ ...
+ dict(type='RandomResizedCrop', scale=224, backend='pillow', interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ ...
+]
+```
+
+```{note}
+Usually, the data augmentation part in the data pipeline handles only image-wise transforms, but not transforms
+like image normalization or mixup/cutmix. It's because we can do image normalization and mixup/cutmix on batch data
+to accelerate. To configure image normalization and mixup/cutmix, please use the [data preprocessor](mmpretrain.models.utils.data_preprocessor).
+```
+
+### Formatting
+
+The formatting is to collect training data from the data information dict and convert these data to
+model-friendly format.
+
+In most cases, you can simply use [`PackInputs`](mmpretrain.datasets.transforms.PackInputs), and it will
+convert the image in NumPy array format to PyTorch tensor, and pack the ground truth categories information and
+other meta information as a [`DataSample`](mmpretrain.structures.DataSample).
+
+```python
+train_pipeline = [
+ ...
+ dict(type='PackInputs'),
+]
+```
+
+## Add new data transforms
+
+1. Write a new data transform in any file, e.g., `my_transform.py`, and place it in
+ the folder `mmpretrain/datasets/transforms/`. The data transform class needs to inherit
+ the [`mmcv.transforms.BaseTransform`](mmcv.transforms.BaseTransform) class and override
+ the `transform` method which takes a dict as input and returns a dict.
+
+ ```python
+ from mmcv.transforms import BaseTransform
+ from mmpretrain.registry import TRANSFORMS
+
+ @TRANSFORMS.register_module()
+ class MyTransform(BaseTransform):
+
+ def transform(self, results):
+ # Modify the data information dict `results`.
+ return results
+ ```
+
+2. Import the new class in the `mmpretrain/datasets/transforms/__init__.py`.
+
+ ```python
+ ...
+ from .my_transform import MyTransform
+
+ __all__ = [
+ ..., 'MyTransform'
+ ]
+ ```
+
+3. Use it in config files.
+
+ ```python
+ train_pipeline = [
+ ...
+ dict(type='MyTransform'),
+ ...
+ ]
+ ```
+
+## Pipeline visualization
+
+After designing data pipelines, you can use the [visualization tools](../useful_tools/dataset_visualization.md) to view the performance.
diff --git a/docs/en/advanced_guides/runtime.md b/docs/en/advanced_guides/runtime.md
new file mode 100644
index 0000000000000000000000000000000000000000..8150fb1432eaeb54553da93b943978eb953925fe
--- /dev/null
+++ b/docs/en/advanced_guides/runtime.md
@@ -0,0 +1,221 @@
+# Customize Runtime Settings
+
+The runtime configurations include many helpful functionalities, like checkpoint saving, logger configuration,
+etc. In this tutorial, we will introduce how to configure these functionalities.
+
+## Save Checkpoint
+
+The checkpoint saving functionality is a default hook during training. And you can configure it in the
+`default_hooks.checkpoint` field.
+
+```{note}
+The hook mechanism is widely used in all OpenMMLab libraries. Through hooks, you can plug in many
+functionalities without modifying the main execution logic of the runner.
+
+A detailed introduction of hooks can be found in {external+mmengine:doc}`Hooks `.
+```
+
+**The default settings**
+
+```python
+default_hooks = dict(
+ ...
+ checkpoint = dict(type='CheckpointHook', interval=1)
+ ...
+)
+```
+
+Here are some usual arguments, and all available arguments can be found in the [CheckpointHook](mmengine.hooks.CheckpointHook).
+
+- **`interval`** (int): The saving period. If use -1, it will never save checkpoints.
+- **`by_epoch`** (bool): Whether the **`interval`** is by epoch or by iteration. Defaults to `True`.
+- **`out_dir`** (str): The root directory to save checkpoints. If not specified, the checkpoints will be saved in the work directory. If specified, the checkpoints will be saved in the sub-folder of the **`out_dir`**.
+- **`max_keep_ckpts`** (int): The maximum checkpoints to keep. In some cases, we want only the latest few checkpoints and would like to delete old ones to save disk space. Defaults to -1, which means unlimited.
+- **`save_best`** (str, List[str]): If specified, it will save the checkpoint with the best evaluation result.
+ Usually, you can simply use `save_best="auto"` to automatically select the evaluation metric.
+
+And if you want more advanced configuration, please refer to the [CheckpointHook docs](tutorials/hook.md#checkpointhook).
+
+## Load Checkpoint / Resume Training
+
+In config files, you can specify the loading and resuming functionality as below:
+
+```python
+# load from which checkpoint
+load_from = "Your checkpoint path"
+
+# whether to resume training from the loaded checkpoint
+resume = False
+```
+
+The `load_from` field can be either a local path or an HTTP path. And you can resume training from the checkpoint by
+specify `resume=True`.
+
+```{tip}
+You can also enable auto resuming from the latest checkpoint by specifying `load_from=None` and `resume=True`.
+Runner will find the latest checkpoint from the work directory automatically.
+```
+
+If you are training models by our `tools/train.py` script, you can also use `--resume` argument to resume
+training without modifying the config file manually.
+
+```bash
+# Automatically resume from the latest checkpoint.
+python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume
+
+# Resume from the specified checkpoint.
+python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume checkpoints/resnet.pth
+```
+
+## Randomness Configuration
+
+In the `randomness` field, we provide some options to make the experiment as reproducible as possible.
+
+By default, we won't specify seed in the config file, and in every experiment, the program will generate a random seed.
+
+**Default settings:**
+
+```python
+randomness = dict(seed=None, deterministic=False)
+```
+
+To make the experiment more reproducible, you can specify a seed and set `deterministic=True`. The influence
+of the `deterministic` option can be found [here](https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking).
+
+## Log Configuration
+
+The log configuration relates to multiple fields.
+
+In the `log_level` field, you can specify the global logging level. See {external+python:ref}`Logging Levels` for a list of levels.
+
+```python
+log_level = 'INFO'
+```
+
+In the `default_hooks.logger` field, you can specify the logging interval during training and testing. And all
+available arguments can be found in the [LoggerHook docs](tutorials/hook.md#loggerhook).
+
+```python
+default_hooks = dict(
+ ...
+ # print log every 100 iterations.
+ logger=dict(type='LoggerHook', interval=100),
+ ...
+)
+```
+
+In the `log_processor` field, you can specify the log smooth method. Usually, we use a window with length of 10
+to smooth the log and output the mean value of all information. If you want to specify the smooth method of
+some information finely, see the {external+mmengine:doc}`LogProcessor docs `.
+
+```python
+# The default setting, which will smooth the values in training log by a 10-length window.
+log_processor = dict(window_size=10)
+```
+
+In the `visualizer` field, you can specify multiple backends to save the log information, such as TensorBoard
+and WandB. More details can be found in the [Visualizer section](#visualizer).
+
+## Custom Hooks
+
+Many above functionalities are implemented by hooks, and you can also plug-in other custom hooks by modifying
+`custom_hooks` field. Here are some hooks in MMEngine and MMPretrain that you can use directly, such as:
+
+- [EMAHook](mmpretrain.engine.hooks.EMAHook)
+- [SyncBuffersHook](mmengine.hooks.SyncBuffersHook)
+- [EmptyCacheHook](mmengine.hooks.EmptyCacheHook)
+- [ClassNumCheckHook](mmpretrain.engine.hooks.ClassNumCheckHook)
+- ......
+
+For example, EMA (Exponential Moving Average) is widely used in the model training, and you can enable it as
+below:
+
+```python
+custom_hooks = [
+ dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL'),
+]
+```
+
+## Visualize Validation
+
+The validation visualization functionality is a default hook during validation. And you can configure it in the
+`default_hooks.visualization` field.
+
+By default, we disabled it, and you can enable it by specifying `enable=True`. And more arguments can be found in
+the [VisualizationHook docs](mmpretrain.engine.hooks.VisualizationHook).
+
+```python
+default_hooks = dict(
+ ...
+ visualization=dict(type='VisualizationHook', enable=False),
+ ...
+)
+```
+
+This hook will select some images in the validation dataset, and tag the prediction results on these images
+during every validation process. You can use it to watch the varying of model performance on actual images
+during training.
+
+In addition, if the images in your validation dataset are small (\<100), you can rescale them before
+visualization by specifying `rescale_factor=2.` or higher.
+
+## Visualizer
+
+The visualizer is used to record all kinds of information during training and test, including logs, images and
+scalars. By default, the recorded information will be saved at the `vis_data` folder under the work directory.
+
+**Default settings:**
+
+```python
+visualizer = dict(
+ type='UniversalVisualizer',
+ vis_backends=[
+ dict(type='LocalVisBackend'),
+ ]
+)
+```
+
+Usually, the most useful function is to save the log and scalars like `loss` to different backends.
+For example, to save them to TensorBoard, simply set them as below:
+
+```python
+visualizer = dict(
+ type='UniversalVisualizer',
+ vis_backends=[
+ dict(type='LocalVisBackend'),
+ dict(type='TensorboardVisBackend'),
+ ]
+)
+```
+
+Or save them to WandB as below:
+
+```python
+visualizer = dict(
+ type='UniversalVisualizer',
+ vis_backends=[
+ dict(type='LocalVisBackend'),
+ dict(type='WandbVisBackend'),
+ ]
+)
+```
+
+## Environment Configuration
+
+In the `env_cfg` field, you can configure some low-level parameters, like cuDNN, multi-process, and distributed
+communication.
+
+**Please make sure you understand the meaning of these parameters before modifying them.**
+
+```python
+env_cfg = dict(
+ # whether to enable cudnn benchmark
+ cudnn_benchmark=False,
+
+ # set multi-process parameters
+ mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+ # set distributed parameters
+ dist_cfg=dict(backend='nccl'),
+)
+```
diff --git a/docs/en/advanced_guides/schedule.md b/docs/en/advanced_guides/schedule.md
new file mode 100644
index 0000000000000000000000000000000000000000..f02075924d2a38de7c65c23e3377c793cec7ff4f
--- /dev/null
+++ b/docs/en/advanced_guides/schedule.md
@@ -0,0 +1,361 @@
+# Customize Training Schedule
+
+In our codebase, [default training schedules](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/schedules) have been provided for common datasets such as CIFAR, ImageNet, etc. If we attempt to experiment on these datasets for higher accuracy or on different new methods and datasets, we might possibly need to modify the strategies.
+
+In this tutorial, we will introduce how to modify configs to construct optimizers, use parameter-wise finely configuration, gradient clipping, gradient accumulation as well as customize learning rate and momentum schedules. Furthermore, introduce a template to customize self-implemented optimizationmethods for the project.
+
+## Customize optimization
+
+We use the `optim_wrapper` field to configure the strategies of optimization, which includes choices of optimizer, choices of automatic mixed precision training, parameter-wise configurations, gradient clipping and accumulation. Details are seen below.
+
+### Use optimizers supported by PyTorch
+
+We support all the optimizers implemented by PyTorch, and to use them, please change the `optimizer` field of config files.
+
+For example, if you want to use [`SGD`](torch.optim.SGD), the modification in config file could be as the following. Notice that optimization related settings should all wrapped inside the `optim_wrapper`.
+
+```python
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=0.0003, weight_decay=0.0001)
+)
+```
+
+```{note}
+`type` in optimizer is not a constructor but a optimizer name in PyTorch.
+Refers to {external+torch:ref}`List of optimizers supported by PyTorch ` for more choices.
+```
+
+To modify the learning rate of the model, just modify the `lr` in the config of optimizer.
+You can also directly set other arguments according to the [API doc](torch.optim) of PyTorch.
+
+For example, if you want to use [`Adam`](torch.optim.Adam) with settings like `torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)` in PyTorch. You could use the config below:
+
+```python
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer = dict(
+ type='Adam',
+ lr=0.001,
+ betas=(0.9, 0.999),
+ eps=1e-08,
+ weight_decay=0,
+ amsgrad=False),
+)
+```
+
+````{note}
+The default type of `optim_wrapper` field is [`OptimWrapper`](mmengine.optim.OptimWrapper), therefore, you can
+omit the type field usually, like:
+
+```python
+optim_wrapper = dict(
+ optimizer=dict(
+ type='Adam',
+ lr=0.001,
+ betas=(0.9, 0.999),
+ eps=1e-08,
+ weight_decay=0,
+ amsgrad=False))
+```
+````
+
+### Use AMP training
+
+If we want to use the automatic mixed precision training, we can simply change the type of `optim_wrapper` to `AmpOptimWrapper` in config files.
+
+```python
+optim_wrapper = dict(type='AmpOptimWrapper', optimizer=...)
+```
+
+Alternatively, for conveniency, we can set `--amp` parameter to turn on the AMP option directly in the `tools/train.py` script. Refers to [Training tutorial](../user_guides/train.md) for details of starting a training.
+
+### Parameter-wise finely configuration
+
+Some models may have parameter-specific settings for optimization, for example, no weight decay to the BatchNorm layers or using different learning rates for different network layers.
+To finely configure them, we can use the `paramwise_cfg` argument in `optim_wrapper`.
+
+- **Set different hyper-parameter multipliers for different types of parameters.**
+
+ For instance, we can set `norm_decay_mult=0.` in `paramwise_cfg` to change the weight decay of weight and bias of normalization layers to zero.
+
+ ```python
+ optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.8, weight_decay=1e-4),
+ paramwise_cfg=dict(norm_decay_mult=0.))
+ ```
+
+ More types of parameters are supported to configured, list as follow:
+
+ - `bias_lr_mult`: Multiplier for learning rate of bias (Not include normalization layers' biases and deformable convolution layers' offsets). Defaults to 1.
+ - `bias_decay_mult`: Multiplier for weight decay of bias (Not include normalization layers' biases and deformable convolution layers' offsets). Defaults to 1.
+ - `norm_decay_mult`: Multiplier for weight decay of weight and bias of normalization layers. Defaults to 1.
+ - `flat_decay_mult`: Multiplier for weight decay of all one-dimensional parameters. Defaults to 1.
+ - `dwconv_decay_mult`: Multiplier for weight decay of depth-wise convolution layers. Defaults to 1.
+ - `bypass_duplicate`: Whether to bypass duplicated parameters. Defaults to `False`.
+ - `dcn_offset_lr_mult`: Multiplier for learning rate of deformable convolution layers. Defaults to 1.
+
+- **Set different hyper-parameter multipliers for specific parameters.**
+
+ MMPretrain can use `custom_keys` in `paramwise_cfg` to specify different parameters to use different learning rates or weight decay.
+
+ For example, to set all learning rates and weight decays of `backbone.layer0` to 0, the rest of `backbone` remains the same as optimizer and the learning rate of `head` to 0.001, use the configs below.
+
+ ```python
+ optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'backbone.layer0': dict(lr_mult=0, decay_mult=0),
+ 'backbone': dict(lr_mult=1),
+ 'head': dict(lr_mult=0.1)
+ }))
+ ```
+
+### Gradient clipping
+
+During the training process, the loss function may get close to a cliffy region and cause gradient explosion. And gradient clipping is helpful to stabilize the training process. More introduction can be found in [this page](https://paperswithcode.com/method/gradient-clipping).
+
+Currently we support `clip_grad` option in `optim_wrapper` for gradient clipping, refers to [PyTorch Documentation](torch.nn.utils.clip_grad_norm_).
+
+Here is an example:
+
+```python
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+ # norm_type: type of the used p-norm, here norm_type is 2.
+ clip_grad=dict(max_norm=35, norm_type=2))
+```
+
+### Gradient accumulation
+
+When computing resources are lacking, the batch size can only be set to a small value, which may affect the performance of models. Gradient accumulation can be used to solve this problem. We support `accumulative_counts` option in `optim_wrapper` for gradient accumulation.
+
+Here is an example:
+
+```python
+train_dataloader = dict(batch_size=64)
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+ accumulative_counts=4)
+```
+
+Indicates that during training, back-propagation is performed every 4 iters. And the above is equivalent to:
+
+```python
+train_dataloader = dict(batch_size=256)
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001))
+```
+
+## Customize parameter schedules
+
+In training, the optimzation parameters such as learing rate, momentum, are usually not fixed but changing through iterations or epochs. PyTorch supports several learning rate schedulers, which are not sufficient for complex strategies. In MMPretrain, we provide `param_scheduler` for better controls of different parameter schedules.
+
+### Customize learning rate schedules
+
+Learning rate schedulers are widely used to improve performance. We support most of the PyTorch schedulers, including `ExponentialLR`, `LinearLR`, `StepLR`, `MultiStepLR`, etc.
+
+All available learning rate scheduler can be found {external+mmengine:doc}`here `, and the
+names of learning rate schedulers end with `LR`.
+
+- **Single learning rate schedule**
+
+ In most cases, we use only one learning rate schedule for simplicity. For instance, [`MultiStepLR`](mmengine.optim.MultiStepLR) is used as the default learning rate schedule for ResNet. Here, `param_scheduler` is a dictionary.
+
+ ```python
+ param_scheduler = dict(
+ type='MultiStepLR',
+ by_epoch=True,
+ milestones=[100, 150],
+ gamma=0.1)
+ ```
+
+ Or, we want to use the [`CosineAnnealingLR`](mmengine.optim.CosineAnnealingLR) scheduler to decay the learning rate:
+
+ ```python
+ param_scheduler = dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ T_max=num_epochs)
+ ```
+
+- **Multiple learning rate schedules**
+
+ In some of the training cases, multiple learning rate schedules are applied for higher accuracy. For example ,in the early stage, training is easy to be volatile, and warmup is a technique to reduce volatility.
+ The learning rate will increase gradually from a minor value to the expected value by warmup and decay afterwards by other schedules.
+
+ In MMPretrain, simply combines desired schedules in `param_scheduler` as a list can achieve the warmup strategy.
+
+ Here are some examples:
+
+ 1. linear warmup during the first 50 iters.
+
+ ```python
+ param_scheduler = [
+ # linear warm-up by iters
+ dict(type='LinearLR',
+ start_factor=0.001,
+ by_epoch=False, # by iters
+ end=50), # only warm up for first 50 iters
+ # main learing rate schedule
+ dict(type='MultiStepLR',
+ by_epoch=True,
+ milestones=[8, 11],
+ gamma=0.1)
+ ]
+ ```
+
+ 2. linear warmup and update lr by iter during the first 10 epochs.
+
+ ```python
+ param_scheduler = [
+ # linear warm-up by epochs in [0, 10) epochs
+ dict(type='LinearLR',
+ start_factor=0.001,
+ by_epoch=True,
+ end=10,
+ convert_to_iter_based=True, # Update learning rate by iter.
+ ),
+ # use CosineAnnealing schedule after 10 epochs
+ dict(type='CosineAnnealingLR', by_epoch=True, begin=10)
+ ]
+ ```
+
+ Notice that, we use `begin` and `end` arguments here to assign the valid range, which is [`begin`, `end`) for this schedule. And the range unit is defined by `by_epoch` argument. If not specified, the `begin` is 0 and the `end` is the max epochs or iterations.
+
+ If the ranges for all schedules are not continuous, the learning rate will stay constant in ignored range, otherwise all valid schedulers will be executed in order in a specific stage, which behaves the same as PyTorch [`ChainedScheduler`](torch.optim.lr_scheduler.ChainedScheduler).
+
+ ```{tip}
+ To check that the learning rate curve is as expected, after completing your configuration file,you could use [optimizer parameter visualization tool](../useful_tools/scheduler_visualization.md) to draw the corresponding learning rate adjustment curve.
+ ```
+
+### Customize momentum schedules
+
+We support using momentum schedulers to modify the optimizer's momentum according to learning rate, which could make the loss converge in a faster way. The usage is the same as learning rate schedulers.
+
+All available learning rate scheduler can be found {external+mmengine:doc}`here `, and the
+names of momentum rate schedulers end with `Momentum`.
+
+Here is an example:
+
+```python
+param_scheduler = [
+ # the lr scheduler
+ dict(type='LinearLR', ...),
+ # the momentum scheduler
+ dict(type='LinearMomentum',
+ start_factor=0.001,
+ by_epoch=False,
+ begin=0,
+ end=1000)
+]
+```
+
+## Add new optimizers or constructors
+
+```{note}
+This part will modify the MMPretrain source code or add code to the MMPretrain framework, beginners can skip it.
+```
+
+### Add new optimizers
+
+In academic research and industrial practice, it may be necessary to use optimization methods not implemented by MMPretrain, and you can add them through the following methods.
+
+1. Implement a New Optimizer
+
+ Assume you want to add an optimizer named `MyOptimizer`, which has arguments `a`, `b`, and `c`.
+ You need to create a new file under `mmpretrain/engine/optimizers`, and implement the new optimizer in the file, for example, in `mmpretrain/engine/optimizers/my_optimizer.py`:
+
+ ```python
+ from torch.optim import Optimizer
+ from mmpretrain.registry import OPTIMIZERS
+
+
+ @OPTIMIZERS.register_module()
+ class MyOptimizer(Optimizer):
+
+ def __init__(self, a, b, c):
+ ...
+
+ def step(self, closure=None):
+ ...
+ ```
+
+2. Import the Optimizer
+
+ To find the above module defined above, this module should be imported during the running.
+
+ Import it in the `mmpretrain/engine/optimizers/__init__.py` to add it into the `mmpretrain.engine` package.
+
+ ```python
+ # In mmpretrain/engine/optimizers/__init__.py
+ ...
+ from .my_optimizer import MyOptimizer # MyOptimizer maybe other class name
+
+ __all__ = [..., 'MyOptimizer']
+ ```
+
+ During running, we will automatically import the `mmpretrain.engine` package and register the `MyOptimizer` at the same time.
+
+3. Specify the Optimizer in Config
+
+ Then you can use `MyOptimizer` in the `optim_wrapper.optimizer` field of config files.
+
+ ```python
+ optim_wrapper = dict(
+ optimizer=dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value))
+ ```
+
+### Add new optimizer constructors
+
+Some models may have some parameter-specific settings for optimization, like different weight decay rate for all `BatchNorm` layers.
+
+Although we already can use [the `optim_wrapper.paramwise_cfg` field](#parameter-wise-finely-configuration) to
+configure various parameter-specific optimizer settings. It may still not cover your need.
+
+Of course, you can modify it. By default, we use the [`DefaultOptimWrapperConstructor`](mmengine.optim.DefaultOptimWrapperConstructor)
+class to deal with the construction of optimizer. And during the construction, it fine-grainedly configures the optimizer settings of
+different parameters according to the `paramwise_cfg`,which could also serve as a template for new optimizer constructor.
+
+You can overwrite these behaviors by add new optimizer constructors.
+
+```python
+# In mmpretrain/engine/optimizers/my_optim_constructor.py
+from mmengine.optim import DefaultOptimWrapperConstructor
+from mmpretrain.registry import OPTIM_WRAPPER_CONSTRUCTORS
+
+
+@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
+class MyOptimWrapperConstructor:
+
+ def __init__(self, optim_wrapper_cfg, paramwise_cfg=None):
+ ...
+
+ def __call__(self, model):
+ ...
+```
+
+Here is a specific example of [OptimWrapperConstructor](mmpretrain.engine.optimizers.LearningRateDecayOptimWrapperConstructor).
+
+And then, import it and use it almost like [the optimizer tutorial](#add-new-optimizers).
+
+1. Import it in the `mmpretrain/engine/optimizers/__init__.py` to add it into the `mmpretrain.engine` package.
+
+ ```python
+ # In mmpretrain/engine/optimizers/__init__.py
+ ...
+ from .my_optim_constructor import MyOptimWrapperConstructor
+
+ __all__ = [..., 'MyOptimWrapperConstructor']
+ ```
+
+2. Use `MyOptimWrapperConstructor` in the `optim_wrapper.constructor` field of config files.
+
+ ```python
+ optim_wrapper = dict(
+ constructor=dict(type='MyOptimWrapperConstructor'),
+ optimizer=...,
+ paramwise_cfg=...,
+ )
+ ```
diff --git a/docs/en/api/apis.rst b/docs/en/api/apis.rst
new file mode 100644
index 0000000000000000000000000000000000000000..074960b6c313b63ff6bb2e98ef85a526a057ad15
--- /dev/null
+++ b/docs/en/api/apis.rst
@@ -0,0 +1,48 @@
+.. role:: hidden
+ :class: hidden-section
+
+.. module:: mmpretrain.apis
+
+mmpretrain.apis
+===================================
+
+These are some high-level APIs for classification tasks.
+
+.. contents:: mmpretrain.apis
+ :depth: 2
+ :local:
+ :backlinks: top
+
+Model
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ list_models
+ get_model
+
+Inference
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+ :template: callable.rst
+
+ ImageClassificationInferencer
+ ImageRetrievalInferencer
+ ImageCaptionInferencer
+ VisualQuestionAnsweringInferencer
+ VisualGroundingInferencer
+ TextToImageRetrievalInferencer
+ ImageToTextRetrievalInferencer
+ NLVRInferencer
+ FeatureExtractor
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ inference_model
diff --git a/docs/en/api/data_process.rst b/docs/en/api/data_process.rst
new file mode 100644
index 0000000000000000000000000000000000000000..af0f6e54ec2b76d61fd504abb79806d610329444
--- /dev/null
+++ b/docs/en/api/data_process.rst
@@ -0,0 +1,329 @@
+.. role:: hidden
+ :class: hidden-section
+
+Data Process
+=================
+
+In MMPreTrain, the data process and the dataset is decomposed. The
+datasets only define how to get samples' basic information from the file
+system. These basic information includes the ground-truth label and raw
+images data / the paths of images.The data process includes data transforms,
+data preprocessors and batch augmentations.
+
+- :mod:`Data Transforms `: Transforms includes loading, preprocessing, formatting and etc.
+- :mod:`Data Preprocessors `: Processes includes collate, normalization, stacking, channel fliping and etc.
+
+ - :mod:`Batch Augmentations `: Batch augmentation involves multiple samples, such as Mixup and CutMix.
+
+.. module:: mmpretrain.datasets.transforms
+
+Data Transforms
+--------------------
+
+To prepare the inputs data, we need to do some transforms on these basic
+information. These transforms includes loading, preprocessing and
+formatting. And a series of data transforms makes up a data pipeline.
+Therefore, you can find the a ``pipeline`` argument in the configs of dataset,
+for example:
+
+.. code:: python
+
+ train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+ ]
+
+ train_dataloader = dict(
+ ....
+ dataset=dict(
+ pipeline=train_pipeline,
+ ....),
+ ....
+ )
+
+Every item of a pipeline list is one of the following data transforms class. And if you want to add a custom data transformation class, the tutorial :doc:`Custom Data Pipelines ` will help you.
+
+.. contents::
+ :depth: 1
+ :local:
+ :backlinks: top
+
+Loading and Formatting
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+ :template: data_transform.rst
+
+ LoadImageFromFile
+ PackInputs
+ PackMultiTaskInputs
+ PILToNumpy
+ NumpyToPIL
+ Transpose
+ Collect
+
+Processing and Augmentation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+ :template: data_transform.rst
+
+ Albumentations
+ CenterCrop
+ ColorJitter
+ EfficientNetCenterCrop
+ EfficientNetRandomCrop
+ Lighting
+ Normalize
+ RandomCrop
+ RandomErasing
+ RandomFlip
+ RandomGrayscale
+ RandomResize
+ RandomResizedCrop
+ Resize
+ ResizeEdge
+ BEiTMaskGenerator
+ SimMIMMaskGenerator
+
+Composed Augmentation
+"""""""""""""""""""""
+Composed augmentation is a kind of methods which compose a series of data
+augmentation transforms, such as ``AutoAugment`` and ``RandAugment``.
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+ :template: data_transform.rst
+
+ AutoAugment
+ RandAugment
+
+The above transforms is composed from a group of policies from the below random
+transforms:
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+ :template: data_transform.rst
+
+ AutoContrast
+ Brightness
+ ColorTransform
+ Contrast
+ Cutout
+ Equalize
+ GaussianBlur
+ Invert
+ Posterize
+ Rotate
+ Sharpness
+ Shear
+ Solarize
+ SolarizeAdd
+ Translate
+ BaseAugTransform
+
+MMCV transforms
+^^^^^^^^^^^^^^^
+
+We also provides many transforms in MMCV. You can use them directly in the config files. Here are some frequently used transforms, and the whole transforms list can be found in :external+mmcv:doc:`api/transforms`.
+
+Transform Wrapper
+^^^^^^^^^^^^^^^^^
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+ :template: data_transform.rst
+
+ MultiView
+
+.. module:: mmpretrain.models.utils.data_preprocessor
+
+
+TorchVision Transforms
+^^^^^^^^^^^^^^^^^^^^^^
+
+We also provide all the transforms in TorchVision. You can use them the like following examples:
+
+**1. Use some TorchVision Augs Surrounded by NumpyToPIL and PILToNumpy (Recommendation)**
+
+Add TorchVision Augs surrounded by ``dict(type='NumpyToPIL', to_rgb=True),`` and ``dict(type='PILToNumpy', to_bgr=True),``
+
+.. code:: python
+
+ train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='NumpyToPIL', to_rgb=True), # from BGR in cv2 to RGB in PIL
+ dict(type='torchvision/RandomResizedCrop',size=176),
+ dict(type='PILToNumpy', to_bgr=True), # from RGB in PIL to BGR in cv2
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+ ]
+
+ data_preprocessor = dict(
+ num_classes=1000,
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True, # from BGR in cv2 to RGB in PIL
+ )
+
+
+**2. Use TorchVision Augs and ToTensor&Normalize**
+
+Make sure the 'img' has been converted to PIL format from BGR-Numpy format before being processed by TorchVision Augs.
+
+.. code:: python
+
+ train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='NumpyToPIL', to_rgb=True), # from BGR in cv2 to RGB in PIL
+ dict(
+ type='torchvision/RandomResizedCrop',
+ size=176,
+ interpolation='bilinear'), # accept str format interpolation mode
+ dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+ dict(
+ type='torchvision/TrivialAugmentWide',
+ interpolation='bilinear'),
+ dict(type='torchvision/PILToTensor'),
+ dict(type='torchvision/ConvertImageDtype', dtype=torch.float),
+ dict(
+ type='torchvision/Normalize',
+ mean=(0.485, 0.456, 0.406),
+ std=(0.229, 0.224, 0.225),
+ ),
+ dict(type='torchvision/RandomErasing', p=0.1),
+ dict(type='PackInputs'),
+ ]
+
+ data_preprocessor = dict(num_classes=1000, mean=None, std=None, to_rgb=False) # Normalize in dataset pipeline
+
+
+**3. Use TorchVision Augs Except ToTensor&Normalize**
+
+.. code:: python
+
+ train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='NumpyToPIL', to_rgb=True), # from BGR in cv2 to RGB in PIL
+ dict(type='torchvision/RandomResizedCrop', size=176, interpolation='bilinear'),
+ dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+ dict(type='torchvision/TrivialAugmentWide', interpolation='bilinear'),
+ dict(type='PackInputs'),
+ ]
+
+ # here the Normalize params is for the RGB format
+ data_preprocessor = dict(
+ num_classes=1000,
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=False,
+ )
+
+
+Data Preprocessors
+------------------
+
+The data preprocessor is also a component to process the data before feeding data to the neural network.
+Comparing with the data transforms, the data preprocessor is a module of the classifier,
+and it takes a batch of data to process, which means it can use GPU and batch to accelebrate the processing.
+
+The default data preprocessor in MMPreTrain could do the pre-processing like following:
+
+1. Move data to the target device.
+2. Pad inputs to the maximum size of current batch.
+3. Stack inputs to a batch.
+4. Convert inputs from bgr to rgb if the shape of input is (3, H, W).
+5. Normalize image with defined std and mean.
+6. Do batch augmentations like Mixup and CutMix during training.
+
+You can configure the data preprocessor by the ``data_preprocessor`` field or ``model.data_preprocessor`` field in the config file. Typical usages are as below:
+
+.. code-block:: python
+
+ data_preprocessor = dict(
+ # RGB format normalization parameters
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True, # convert image from BGR to RGB
+ )
+
+Or define in ``model.data_preprocessor`` as following:
+
+.. code-block:: python
+
+ model = dict(
+ backbone = ...,
+ neck = ...,
+ head = ...,
+ data_preprocessor = dict(
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ to_rgb=True)
+ train_cfg=...,
+ )
+
+Note that the ``model.data_preprocessor`` has higher priority than ``data_preprocessor``.
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ ClsDataPreprocessor
+ SelfSupDataPreprocessor
+ TwoNormDataPreprocessor
+ VideoDataPreprocessor
+
+.. module:: mmpretrain.models.utils.batch_augments
+
+Batch Augmentations
+^^^^^^^^^^^^^^^^^^^^
+
+The batch augmentation is a component of data preprocessors. It involves multiple samples and mix them in some way, such as Mixup and CutMix.
+
+These augmentations are usually only used during training, therefore, we use the ``model.train_cfg`` field to configure them in config files.
+
+.. code-block:: python
+
+ model = dict(
+ backbone=...,
+ neck=...,
+ head=...,
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ]),
+ )
+
+You can also specify the probabilities of every batch augmentation by the ``probs`` field.
+
+.. code-block:: python
+
+ model = dict(
+ backbone=...,
+ neck=...,
+ head=...,
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8),
+ dict(type='CutMix', alpha=1.0),
+ ], probs=[0.3, 0.7])
+ )
+
+Here is a list of batch augmentations can be used in MMPreTrain.
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+ :template: callable.rst
+
+ Mixup
+ CutMix
+ ResizeMix
diff --git a/docs/en/api/datasets.rst b/docs/en/api/datasets.rst
new file mode 100644
index 0000000000000000000000000000000000000000..069880dd722457225c864639600aa5e0ff54f6ff
--- /dev/null
+++ b/docs/en/api/datasets.rst
@@ -0,0 +1,129 @@
+.. role:: hidden
+ :class: hidden-section
+
+.. module:: mmpretrain.datasets
+
+mmpretrain.datasets
+===================================
+
+The ``datasets`` package contains several usual datasets for image classification tasks and some dataset wrappers.
+
+.. contents:: mmpretrain.datasets
+ :depth: 2
+ :local:
+ :backlinks: top
+
+Custom Dataset
+--------------
+
+.. autoclass:: CustomDataset
+
+ImageNet
+--------
+
+.. autoclass:: ImageNet
+
+.. autoclass:: ImageNet21k
+
+CIFAR
+-----
+
+.. autoclass:: CIFAR10
+
+.. autoclass:: CIFAR100
+
+MNIST
+-----
+
+.. autoclass:: MNIST
+
+.. autoclass:: FashionMNIST
+
+VOC
+---
+
+.. autoclass:: VOC
+
+CUB
+---
+
+.. autoclass:: CUB
+
+Places205
+---------
+
+.. autoclass:: Places205
+
+Retrieval
+---------
+
+.. autoclass:: InShop
+
+Base classes
+------------
+
+.. autoclass:: BaseDataset
+
+.. autoclass:: MultiLabelDataset
+
+Caltech101
+----------------
+
+.. autoclass:: Caltech101
+
+Food101
+----------------
+
+.. autoclass:: Food101
+
+DTD
+----------------
+
+.. autoclass:: DTD
+
+FGVCAircraft
+----------------
+
+.. autoclass:: FGVCAircraft
+
+
+Flowers102
+----------------
+
+.. autoclass:: Flowers102
+
+StanfordCars
+----------------
+
+.. autoclass:: StanfordCars
+
+OxfordIIITPet
+----------------
+
+.. autoclass:: OxfordIIITPet
+
+SUN397
+----------------
+
+.. autoclass:: SUN397
+
+RefCOCO
+--------
+
+.. autoclass:: RefCOCO
+
+Dataset Wrappers
+----------------
+
+.. autoclass:: KFoldDataset
+
+The dataset wrappers in the MMEngine can be directly used in MMPreTrain.
+
+.. list-table::
+
+ * - :class:`~mmengine.dataset.ConcatDataset`
+ - A wrapper of concatenated dataset.
+ * - :class:`~mmengine.dataset.RepeatDataset`
+ - A wrapper of repeated dataset.
+ * - :class:`~mmengine.dataset.ClassBalancedDataset`
+ - A wrapper of class balanced dataset.
diff --git a/docs/en/api/engine.rst b/docs/en/api/engine.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2e67fd064058dae19a188efd4e2f513b13ba63c6
--- /dev/null
+++ b/docs/en/api/engine.rst
@@ -0,0 +1,51 @@
+.. role:: hidden
+ :class: hidden-section
+
+.. module:: mmpretrain.engine
+
+mmpretrain.engine
+===================================
+
+This package includes some runtime components, including hooks, runners, optimizers and loops. These components are useful in
+classification tasks but not supported by MMEngine yet.
+
+.. note::
+
+ Some components may be moved to MMEngine in the future.
+
+.. contents:: mmpretrain.engine
+ :depth: 2
+ :local:
+ :backlinks: top
+
+.. module:: mmpretrain.engine.hooks
+
+Hooks
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ ClassNumCheckHook
+ PreciseBNHook
+ VisualizationHook
+ PrepareProtoBeforeValLoopHook
+ SetAdaptiveMarginsHook
+ EMAHook
+ SimSiamHook
+ DenseCLHook
+ SwAVHook
+
+.. module:: mmpretrain.engine.optimizers
+
+Optimizers
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ Lamb
+ LARS
+ LearningRateDecayOptimWrapperConstructor
diff --git a/docs/en/api/evaluation.rst b/docs/en/api/evaluation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..bddea207879dec23ce72efe68b682561836dcd92
--- /dev/null
+++ b/docs/en/api/evaluation.rst
@@ -0,0 +1,47 @@
+.. role:: hidden
+ :class: hidden-section
+
+.. module:: mmpretrain.evaluation
+
+mmpretrain.evaluation
+===================================
+
+This package includes metrics and evaluators for classification tasks.
+
+.. contents:: mmpretrain.evaluation
+ :depth: 1
+ :local:
+ :backlinks: top
+
+Single Label Metric
+----------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ Accuracy
+ SingleLabelMetric
+ ConfusionMatrix
+
+Multi Label Metric
+----------------------
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ AveragePrecision
+ MultiLabelMetric
+ VOCAveragePrecision
+ VOCMultiLabelMetric
+
+Retrieval Metric
+----------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+ :template: classtemplate.rst
+
+ RetrievalRecall
+ RetrievalAveragePrecision
diff --git a/docs/en/api/models.rst b/docs/en/api/models.rst
new file mode 100644
index 0000000000000000000000000000000000000000..30980324a4fa0302806cfbb5c5dee903782b9757
--- /dev/null
+++ b/docs/en/api/models.rst
@@ -0,0 +1,364 @@
+.. role:: hidden
+ :class: hidden-section
+
+.. module:: mmpretrain.models
+
+mmpretrain.models
+===================================
+
+The ``models`` package contains several sub-packages for addressing the different components of a model.
+
+- :mod:`~mmpretrain.models.classifiers`: The top-level module which defines the whole process of a classification model.
+- :mod:`~mmpretrain.models.selfsup`: The top-level module which defines the whole process of a self-supervised learning model.
+- :mod:`~mmpretrain.models.retrievers`: The top-level module which defines the whole process of a retrieval model.
+- :mod:`~mmpretrain.models.backbones`: Usually a feature extraction network, e.g., ResNet, MobileNet.
+- :mod:`~mmpretrain.models.necks`: The component between backbones and heads, e.g., GlobalAveragePooling.
+- :mod:`~mmpretrain.models.heads`: The component for specific tasks.
+- :mod:`~mmpretrain.models.losses`: Loss functions.
+- :mod:`~mmpretrain.models.peft`: The PEFT (Parameter-Efficient Fine-Tuning) module, e.g. LoRAModel.
+- :mod:`~mmpretrain.models.utils`: Some helper functions and common components used in various networks.
+
+ - :mod:`~mmpretrain.models.utils.data_preprocessor`: The component before model to preprocess the inputs, e.g., ClsDataPreprocessor.
+ - :ref:`components`: Common components used in various networks.
+ - :ref:`helpers`: Helper functions.
+
+Build Functions
+---------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ build_classifier
+ build_backbone
+ build_neck
+ build_head
+ build_loss
+
+.. module:: mmpretrain.models.classifiers
+
+Classifiers
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ BaseClassifier
+ ImageClassifier
+ TimmClassifier
+ HuggingFaceClassifier
+
+.. module:: mmpretrain.models.selfsup
+
+Self-supervised Algorithms
+--------------------------
+
+.. _selfsup_algorithms:
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ BaseSelfSupervisor
+ BEiT
+ BYOL
+ BarlowTwins
+ CAE
+ DenseCL
+ EVA
+ iTPN
+ MAE
+ MILAN
+ MaskFeat
+ MixMIM
+ MoCo
+ MoCoV3
+ SimCLR
+ SimMIM
+ SimSiam
+ SparK
+ SwAV
+
+.. _selfsup_backbones:
+
+Some of above algorithms modified the backbone module to adapt the extra inputs
+like ``mask``, and here is the a list of these **modified backbone** modules.
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ BEiTPretrainViT
+ CAEPretrainViT
+ iTPNHiViT
+ MAEHiViT
+ MAEViT
+ MILANViT
+ MaskFeatViT
+ MixMIMPretrainTransformer
+ MoCoV3ViT
+ SimMIMSwinTransformer
+
+.. _target_generators:
+
+Some self-supervise algorithms need an external **target generator** to
+generate the optimization target. Here is a list of target generators.
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ VQKD
+ DALLEEncoder
+ HOGGenerator
+ CLIPGenerator
+
+.. module:: mmpretrain.models.retrievers
+
+Retrievers
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ BaseRetriever
+ ImageToImageRetriever
+
+.. module:: mmpretrain.models.multimodal
+
+Multi-Modality Algorithms
+--------------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ Blip2Caption
+ Blip2Retrieval
+ Blip2VQA
+ BlipCaption
+ BlipGrounding
+ BlipNLVR
+ BlipRetrieval
+ BlipVQA
+ Flamingo
+ OFA
+ MiniGPT4
+ Llava
+ Otter
+
+.. module:: mmpretrain.models.backbones
+
+Backbones
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ AlexNet
+ BEiTViT
+ CSPDarkNet
+ CSPNet
+ CSPResNeXt
+ CSPResNet
+ Conformer
+ ConvMixer
+ ConvNeXt
+ DaViT
+ DeiT3
+ DenseNet
+ DistilledVisionTransformer
+ EdgeNeXt
+ EfficientFormer
+ EfficientNet
+ EfficientNetV2
+ HiViT
+ HRNet
+ HorNet
+ InceptionV3
+ LeNet5
+ LeViT
+ MViT
+ MlpMixer
+ MobileNetV2
+ MobileNetV3
+ MobileOne
+ MobileViT
+ PCPVT
+ PoolFormer
+ PyramidVig
+ RegNet
+ RepLKNet
+ RepMLPNet
+ RepVGG
+ Res2Net
+ ResNeSt
+ ResNeXt
+ ResNet
+ ResNetV1c
+ ResNetV1d
+ ResNet_CIFAR
+ RevVisionTransformer
+ SEResNeXt
+ SEResNet
+ SVT
+ ShuffleNetV1
+ ShuffleNetV2
+ SparseResNet
+ SparseConvNeXt
+ SwinTransformer
+ SwinTransformerV2
+ T2T_ViT
+ TIMMBackbone
+ TNT
+ VAN
+ VGG
+ Vig
+ VisionTransformer
+ ViTSAM
+ XCiT
+ ViTEVA02
+
+.. module:: mmpretrain.models.necks
+
+Necks
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ BEiTV2Neck
+ CAENeck
+ ClsBatchNormNeck
+ DenseCLNeck
+ GeneralizedMeanPooling
+ GlobalAveragePooling
+ HRFuseScales
+ LinearNeck
+ MAEPretrainDecoder
+ MILANPretrainDecoder
+ MixMIMPretrainDecoder
+ MoCoV2Neck
+ NonLinearNeck
+ SimMIMLinearDecoder
+ SwAVNeck
+ iTPNPretrainDecoder
+ SparKLightDecoder
+
+.. module:: mmpretrain.models.heads
+
+Heads
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ ArcFaceClsHead
+ BEiTV1Head
+ BEiTV2Head
+ CAEHead
+ CSRAClsHead
+ ClsHead
+ ConformerHead
+ ContrastiveHead
+ DeiTClsHead
+ EfficientFormerClsHead
+ LatentCrossCorrelationHead
+ LatentPredictHead
+ LeViTClsHead
+ LinearClsHead
+ MAEPretrainHead
+ MIMHead
+ MixMIMPretrainHead
+ MoCoV3Head
+ MultiLabelClsHead
+ MultiLabelLinearClsHead
+ MultiTaskHead
+ SimMIMHead
+ StackedLinearClsHead
+ SwAVHead
+ VigClsHead
+ VisionTransformerClsHead
+ iTPNClipHead
+ SparKPretrainHead
+
+.. module:: mmpretrain.models.losses
+
+Losses
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ AsymmetricLoss
+ CAELoss
+ CosineSimilarityLoss
+ CrossCorrelationLoss
+ CrossEntropyLoss
+ FocalLoss
+ LabelSmoothLoss
+ PixelReconstructionLoss
+ SeesawLoss
+ SwAVLoss
+
+.. module:: mmpretrain.models.peft
+
+PEFT
+------------------
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ LoRAModel
+
+.. module:: mmpretrain.models.utils
+
+models.utils
+------------
+
+This package includes some helper functions and common components used in various networks.
+
+.. _components:
+
+Common Components
+^^^^^^^^^^^^^^^^^
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ ConditionalPositionEncoding
+ CosineEMA
+ HybridEmbed
+ InvertedResidual
+ LayerScale
+ MultiheadAttention
+ PatchEmbed
+ PatchMerging
+ SELayer
+ ShiftWindowMSA
+ WindowMSA
+ WindowMSAV2
+
+.. _helpers:
+
+Helper Functions
+^^^^^^^^^^^^^^^^
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ channel_shuffle
+ is_tracing
+ make_divisible
+ resize_pos_embed
+ resize_relative_position_bias_table
+ to_ntuple
diff --git a/docs/en/api/structures.rst b/docs/en/api/structures.rst
new file mode 100644
index 0000000000000000000000000000000000000000..10caa37c8e96dde2f2fa57714d68f16ec2893967
--- /dev/null
+++ b/docs/en/api/structures.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+ :class: hidden-section
+
+.. module:: mmpretrain.structures
+
+mmpretrain.structures
+===================================
+
+This package includes basic data structures.
+
+DataSample
+-------------
+.. autoclass:: DataSample
diff --git a/docs/en/api/utils.rst b/docs/en/api/utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b2b9ea91c5589b33206c2ce614e92c16a02a2179
--- /dev/null
+++ b/docs/en/api/utils.rst
@@ -0,0 +1,19 @@
+.. role:: hidden
+ :class: hidden-section
+
+.. module:: mmpretrain.utils
+
+mmpretrain.utils
+===================================
+
+This package includes some useful helper functions for developing.
+
+.. autosummary::
+ :toctree: generated
+ :nosignatures:
+
+ collect_env
+ register_all_modules
+ load_json_log
+ track_on_main_process
+ get_ori_model
diff --git a/docs/en/api/visualization.rst b/docs/en/api/visualization.rst
new file mode 100644
index 0000000000000000000000000000000000000000..85742a1c487f9ceff424f35fd8e1b0e2898997a1
--- /dev/null
+++ b/docs/en/api/visualization.rst
@@ -0,0 +1,14 @@
+.. role:: hidden
+ :class: hidden-section
+
+.. module:: mmpretrain.visualization
+
+mmpretrain.visualization
+===================================
+
+This package includes visualizer and some helper functions for visualization.
+
+Visualizer
+-------------
+.. autoclass:: UniversalVisualizer
+ :members:
diff --git a/docs/en/conf.py b/docs/en/conf.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5a7fefbb9fd95f46075d926a6dc525ae50a28e5
--- /dev/null
+++ b/docs/en/conf.py
@@ -0,0 +1,248 @@
+# flake8: noqa
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import subprocess
+import sys
+
+import pytorch_sphinx_theme
+from sphinx.builders.html import StandaloneHTMLBuilder
+
+sys.path.insert(0, os.path.abspath('../../'))
+
+# -- Project information -----------------------------------------------------
+
+project = 'MMPretrain'
+copyright = '2020, OpenMMLab'
+author = 'MMPretrain Authors'
+
+# The full version, including alpha/beta/rc tags
+version_file = '../../mmpretrain/version.py'
+
+
+def get_version():
+ with open(version_file, 'r') as f:
+ exec(compile(f.read(), version_file, 'exec'))
+ return locals()['__version__']
+
+
+release = get_version()
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+ 'sphinx.ext.autodoc',
+ 'sphinx.ext.autosummary',
+ 'sphinx.ext.intersphinx',
+ 'sphinx.ext.napoleon',
+ 'sphinx.ext.viewcode',
+ 'myst_parser',
+ 'sphinx_copybutton',
+ 'sphinx_tabs.tabs',
+ 'notfound.extension',
+ 'sphinxcontrib.jquery',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = {
+ '.rst': 'restructuredtext',
+ '.md': 'markdown',
+}
+
+language = 'en'
+
+# The master toctree document.
+root_doc = 'index'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages. See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'pytorch_sphinx_theme'
+html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further. For a list of options available for each theme, see the
+# documentation.
+# yapf: disable
+html_theme_options = {
+ 'menu': [
+ {
+ 'name': 'GitHub',
+ 'url': 'https://github.com/open-mmlab/mmpretrain'
+ },
+ {
+ 'name': 'Colab Tutorials',
+ 'children': [
+ {'name': 'Train and inference with shell commands',
+ 'url': 'https://colab.research.google.com/github/mzr1996/mmpretrain-tutorial/blob/master/1.x/MMPretrain_tools.ipynb'},
+ {'name': 'Train and inference with Python APIs',
+ 'url': 'https://colab.research.google.com/github/mzr1996/mmpretrain-tutorial/blob/master/1.x/MMPretrain_python.ipynb'},
+ ]
+ },
+ {
+ 'name': 'Version',
+ 'children': [
+ {'name': 'MMPreTrain 0.x',
+ 'url': 'https://mmpretrain.readthedocs.io/en/0.x/',
+ 'description': '0.x branch'},
+ {'name': 'MMPreTrain 1.x',
+ 'url': 'https://mmpretrain.readthedocs.io/en/latest/',
+ 'description': 'Main branch'},
+ ],
+ }
+ ],
+ # Specify the language of shared menu
+ 'menu_lang': 'en',
+ # Disable the default edit on GitHub
+ 'default_edit_on_github': False,
+}
+# yapf: enable
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+html_css_files = [
+ 'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.css',
+ 'css/readthedocs.css'
+]
+html_js_files = [
+ 'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.js',
+ 'js/custom.js'
+]
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'mmpretraindoc'
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+ # The paper size ('letterpaper' or 'a4paper').
+ #
+ # 'papersize': 'letterpaper',
+
+ # The font size ('10pt', '11pt' or '12pt').
+ #
+ # 'pointsize': '10pt',
+
+ # Additional stuff for the LaTeX preamble.
+ #
+ # 'preamble': '',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+# author, documentclass [howto, manual, or own class]).
+latex_documents = [
+ (root_doc, 'mmpretrain.tex', 'MMPretrain Documentation', author, 'manual'),
+]
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(root_doc, 'mmpretrain', 'MMPretrain Documentation', [author], 1)]
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+# dir menu entry, description, category)
+texinfo_documents = [
+ (root_doc, 'mmpretrain', 'MMPretrain Documentation', author, 'mmpretrain',
+ 'OpenMMLab pre-training toolbox and benchmark.', 'Miscellaneous'),
+]
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+# set priority when building html
+StandaloneHTMLBuilder.supported_image_types = [
+ 'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
+]
+
+# -- Extension configuration -------------------------------------------------
+# Ignore >>> when copying code
+copybutton_prompt_text = r'>>> |\.\.\. '
+copybutton_prompt_is_regexp = True
+
+# Auto-generated header anchors
+myst_heading_anchors = 3
+# Enable "colon_fence" extension of myst.
+myst_enable_extensions = ['colon_fence', 'dollarmath']
+
+# Configuration for intersphinx
+intersphinx_mapping = {
+ 'python': ('https://docs.python.org/3', None),
+ 'numpy': ('https://numpy.org/doc/stable', None),
+ 'torch': ('https://pytorch.org/docs/stable/', None),
+ 'mmcv': ('https://mmcv.readthedocs.io/en/2.x/', None),
+ 'mmengine': ('https://mmengine.readthedocs.io/en/latest/', None),
+ 'transformers':
+ ('https://huggingface.co/docs/transformers/main/en/', None),
+}
+napoleon_custom_sections = [
+ # Custom sections for data elements.
+ ('Meta fields', 'params_style'),
+ ('Data fields', 'params_style'),
+]
+
+# Disable docstring inheritance
+autodoc_inherit_docstrings = False
+# Mock some imports during generate API docs.
+autodoc_mock_imports = ['rich', 'attr', 'einops', 'mat4py']
+# Disable displaying type annotations, these can be very verbose
+autodoc_typehints = 'none'
+
+# The not found page
+notfound_template = '404.html'
+
+
+def builder_inited_handler(app):
+ if subprocess.run(['./stat.py']).returncode != 0:
+ raise RuntimeError('Failed to run the script `stat.py`.')
+
+
+def setup(app):
+ app.connect('builder-inited', builder_inited_handler)
diff --git a/docs/en/device/npu.md b/docs/en/device/npu.md
new file mode 100644
index 0000000000000000000000000000000000000000..d450029f7211bf10e00568bf00d26567f15b59a0
--- /dev/null
+++ b/docs/en/device/npu.md
@@ -0,0 +1,47 @@
+# NPU (HUAWEI Ascend)
+
+## Usage
+
+### General Usage
+
+Please refer to the [building documentation of MMCV](https://mmcv.readthedocs.io/en/latest/get_started/build.html#build-mmcv-full-on-ascend-npu-machine) to install MMCV and [MMEngine](https://mmengine.readthedocs.io/en/latest/get_started/installation.html#build-from-source) on NPU devices.
+
+Here we use 8 NPUs on your computer to train the model with the following command:
+
+```shell
+bash ./tools/dist_train.sh configs/resnet/resnet50_8xb32_in1k.py 8
+```
+
+Also, you can use only one NPU to train the model with the following command:
+
+```shell
+python ./tools/train.py configs/resnet/resnet50_8xb32_in1k.py
+```
+
+## Models Results
+
+| Model | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------------------------------: | :-------: | :-------: | :----------------------------------------------------------: | :-------------------------------------------------------------: |
+| [ResNet-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/README.md) | 76.40 | 93.21 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnet50_8xb32_in1k.log) |
+| [ResNetXt-32x4d-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/README.md) | 77.48 | 93.75 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/resnext50-32x4d_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnext50-32x4d_8xb32_in1k.log) |
+| [HRNet-W18](https://github.com/open-mmlab/mmclassification/blob/master/configs/hrnet/README.md) | 77.06 | 93.57 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w18_4xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/hrnet-w18_4xb32_in1k.log) |
+| [ResNetV1D-152](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/README.md) | 79.41 | 94.48 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1d152_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnetv1d152_8xb32_in1k.log) |
+| [SE-ResNet-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/README.md) | 77.65 | 93.74 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/seresnet50_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/seresnet50_8xb32_in1k.log) |
+| [ShuffleNetV2 1.0x](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v2/README.md) | 69.52 | 88.79 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/shufflenet-v2-1x_16xb64_in1k.log) |
+| [MobileNetV2](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/mobilenet_v2) | 71.74 | 90.28 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/mobilenet-v2_8xb32_in1k.log) |
+| [MobileNetV3-Small](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/README.md) | 67.09 | 87.17 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/mobilenet-v3-small.log) |
+| [\*CSPResNeXt50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/README.md) | 77.25 | 93.46 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/cspresnext50_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/cspresnext50_8xb32_in1k.log) |
+| [\*EfficientNet-B4](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/README.md) | 75.73 | 92.91 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-b4_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/efficientnet-b4_8xb32_in1k.log) |
+| [\*\*DenseNet121](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/README.md) | 72.53 | 90.85 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/densenet121_4xb256_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/densenet121_4xb256_in1k.log) |
+
+**Notes:**
+
+- If not specially marked, the results are almost same between results on the NPU and results on the GPU with FP32.
+- (\*) The training results of these models are lower than the results on the readme in the corresponding model, mainly
+ because the results on the readme are directly the weight of the timm of the eval, and the results on this side are
+ retrained according to the config with mmcls. The results of the config training on the GPU are consistent with the
+ results of the NPU.
+- (\*\*) The accuracy of this model is slightly lower because config is a 4-card config, we use 8 cards to run, and users
+ can adjust hyperparameters to get the best accuracy results.
+
+**All above models are provided by Huawei Ascend group.**
diff --git a/docs/en/docutils.conf b/docs/en/docutils.conf
new file mode 100644
index 0000000000000000000000000000000000000000..0c00c84688701117f231fd0c8ec295fb747b7d8f
--- /dev/null
+++ b/docs/en/docutils.conf
@@ -0,0 +1,2 @@
+[html writers]
+table_style: colwidths-auto
diff --git a/docs/en/get_started.md b/docs/en/get_started.md
new file mode 100644
index 0000000000000000000000000000000000000000..5d33ac00969a0701fbd067b9ad2321303c04a49d
--- /dev/null
+++ b/docs/en/get_started.md
@@ -0,0 +1,164 @@
+# Prerequisites
+
+In this section we demonstrate how to prepare an environment with PyTorch.
+
+MMPretrain works on Linux, Windows and macOS. It requires Python 3.7+, CUDA 10.2+ and PyTorch 1.8+.
+
+```{note}
+If you are experienced with PyTorch and have already installed it, just skip this part and jump to the [next section](#installation). Otherwise, you can follow these steps for the preparation.
+```
+
+**Step 1.** Download and install Miniconda from the [official website](https://docs.conda.io/en/latest/miniconda.html).
+
+**Step 2.** Create a conda environment and activate it.
+
+```shell
+conda create --name openmmlab python=3.8 -y
+conda activate openmmlab
+```
+
+**Step 3.** Install PyTorch following [official instructions](https://pytorch.org/get-started/locally/), e.g.
+
+On GPU platforms:
+
+```shell
+conda install pytorch torchvision -c pytorch
+```
+
+```{warning}
+This command will automatically install the latest version PyTorch and cudatoolkit, please check whether they match your environment.
+```
+
+On CPU platforms:
+
+```shell
+conda install pytorch torchvision cpuonly -c pytorch
+```
+
+# Installation
+
+## Best Practices
+
+According to your needs, we support two install modes:
+
+- [Install from source (Recommended)](#install-from-source): You want to develop your own network or new features based on MMPretrain framework. For example, adding new datasets or new backbones. And you can use all tools we provided.
+- [Install as a Python package](#install-as-a-python-package): You just want to call MMPretrain's APIs or import MMPretrain's modules in your project.
+
+### Install from source
+
+In this case, install mmpretrain from source:
+
+```shell
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+pip install -U openmim && mim install -e .
+```
+
+```{note}
+`"-e"` means installing a project in editable mode, thus any local modifications made to the code will take effect without reinstallation.
+```
+
+### Install as a Python package
+
+Just install with mim.
+
+```shell
+pip install -U openmim && mim install "mmpretrain>=1.0.0rc8"
+```
+
+```{note}
+`mim` is a light-weight command-line tool to setup appropriate environment for OpenMMLab repositories according to PyTorch and CUDA version. It also has some useful functions for deep-learning experiments.
+```
+
+## Install multi-modality support (Optional)
+
+The multi-modality models in MMPretrain requires extra dependencies. To install these dependencies, you
+can add `[multimodal]` during the installation. For example:
+
+```shell
+# Install from source
+mim install -e ".[multimodal]"
+
+# Install as a Python package
+mim install "mmpretrain[multimodal]>=1.0.0rc8"
+```
+
+## Verify the installation
+
+To verify whether MMPretrain is installed correctly, we provide some sample codes to run an inference demo.
+
+Option (a). If you install mmpretrain from the source, just run the following command:
+
+```shell
+python demo/image_demo.py demo/demo.JPEG resnet18_8xb32_in1k --device cpu
+```
+
+You will see the output result dict including `pred_label`, `pred_score` and `pred_class` in your terminal.
+
+Option (b). If you install mmpretrain as a python package, open your python interpreter and copy&paste the following codes.
+
+```python
+from mmpretrain import get_model, inference_model
+
+model = get_model('resnet18_8xb32_in1k', device='cpu') # or device='cuda:0'
+inference_model(model, 'demo/demo.JPEG')
+```
+
+You will see a dict printed, including the predicted label, score and category name.
+
+```{note}
+The `resnet18_8xb32_in1k` is the model name, and you can use [`mmpretrain.list_models`](mmpretrain.apis.list_models) to
+explore all models, or search them on the [Model Zoo Summary](./modelzoo_statistics.md)
+```
+
+## Customize Installation
+
+### CUDA versions
+
+When installing PyTorch, you need to specify the version of CUDA. If you are
+not clear on which to choose, follow our recommendations:
+
+- For Ampere-based NVIDIA GPUs, such as GeForce 30 series and NVIDIA A100, CUDA 11 is a must.
+- For older NVIDIA GPUs, CUDA 11 is backward compatible, but CUDA 10.2 offers better compatibility and is more lightweight.
+
+Please make sure the GPU driver satisfies the minimum version requirements. See [this table](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions__table-cuda-toolkit-driver-versions) for more information.
+
+```{note}
+Installing CUDA runtime libraries is enough if you follow our best practices,
+because no CUDA code will be compiled locally. However if you hope to compile
+MMCV from source or develop other CUDA operators, you need to install the
+complete CUDA toolkit from NVIDIA's [website](https://developer.nvidia.com/cuda-downloads),
+and its version should match the CUDA version of PyTorch. i.e., the specified
+version of cudatoolkit in `conda install` command.
+```
+
+### Install on CPU-only platforms
+
+MMPretrain can be built for CPU only environment. In CPU mode you can train, test or inference a model.
+
+### Install on Google Colab
+
+See [the Colab tutorial](https://colab.research.google.com/github/mzr1996/mmclassification-tutorial/blob/master/1.x/MMClassification_tools.ipynb).
+
+### Using MMPretrain with Docker
+
+We provide a [Dockerfile](https://github.com/open-mmlab/mmpretrain/blob/main/docker/Dockerfile)
+to build an image. Ensure that your [docker version](https://docs.docker.com/engine/install/) >=19.03.
+
+```shell
+# build an image with PyTorch 1.12.1, CUDA 11.3
+# If you prefer other versions, just modified the Dockerfile
+docker build -t mmpretrain docker/
+```
+
+Run it with
+
+```shell
+docker run --gpus all --shm-size=8g -it -v {DATA_DIR}:/mmpretrain/data mmpretrain
+```
+
+## Trouble shooting
+
+If you have some issues during the installation, please first view the [FAQ](./notes/faq.md) page.
+You may [open an issue](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+on GitHub if no solution is found.
diff --git a/docs/en/index.rst b/docs/en/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d16a32d603d018eb209e9dca546b5e200c0fba25
--- /dev/null
+++ b/docs/en/index.rst
@@ -0,0 +1,157 @@
+Welcome to MMPretrain's documentation!
+============================================
+
+MMPretrain is a newly upgraded open-source framework for pre-training.
+It has set out to provide multiple powerful pre-trained backbones and
+support different pre-training strategies. MMPretrain originated from the
+famous open-source projects
+`MMClassification `_
+and `MMSelfSup `_, and is developed
+with many exiciting new features. The pre-training stage is essential for
+vision recognition currently. With the rich and strong pre-trained models,
+we are currently capable of improving various downstream vision tasks.
+
+Our primary objective for the codebase is to become an easily accessible and
+user-friendly library and to streamline research and engineering. We
+detail the properties and design of MMPretrain across different sections.
+
+Hands-on Roadmap of MMPretrain
+-------------------------------
+
+To help users quickly utilize MMPretrain, we recommend following the hands-on
+roadmap we have created for the library:
+
+ - For users who want to try MMPretrain, we suggest reading the GetStarted_
+ section for the environment setup.
+
+ - For basic usage, we refer users to UserGuides_ for utilizing various
+ algorithms to obtain the pre-trained models and evaluate their performance
+ in downstream tasks.
+
+ - For those who wish to customize their own algorithms, we provide
+ AdvancedGuides_ that include hints and rules for modifying code.
+
+ - To find your desired pre-trained models, users could check the ModelZoo_,
+ which features a summary of various backbones and pre-training methods and
+ introfuction of different algorithms.
+
+ - Additionally, we provide Analysis_ and Visualization_ tools to help
+ diagnose algorithms.
+
+ - Besides, if you have any other questions or concerns, please refer to the
+ Notes_ section for potential answers.
+
+We always welcome *PRs* and *Issues* for the betterment of MMPretrain.
+
+.. _GetStarted:
+.. toctree::
+ :maxdepth: 1
+ :caption: Get Started
+
+ get_started.md
+
+.. _UserGuides:
+.. toctree::
+ :maxdepth: 1
+ :caption: User Guides
+
+ user_guides/config.md
+ user_guides/dataset_prepare.md
+ user_guides/inference.md
+ user_guides/train.md
+ user_guides/test.md
+ user_guides/downstream.md
+
+.. _AdvancedGuides:
+.. toctree::
+ :maxdepth: 1
+ :caption: Advanced Guides
+
+ advanced_guides/datasets.md
+ advanced_guides/pipeline.md
+ advanced_guides/modules.md
+ advanced_guides/schedule.md
+ advanced_guides/runtime.md
+ advanced_guides/evaluation.md
+ advanced_guides/convention.md
+
+.. _ModelZoo:
+.. toctree::
+ :maxdepth: 1
+ :caption: Model Zoo
+ :glob:
+
+ modelzoo_statistics.md
+ papers/*
+
+.. _Visualization:
+.. toctree::
+ :maxdepth: 1
+ :caption: Visualization
+
+ useful_tools/dataset_visualization.md
+ useful_tools/scheduler_visualization.md
+ useful_tools/cam_visualization.md
+ useful_tools/t-sne_visualization.md
+
+.. _Analysis:
+.. toctree::
+ :maxdepth: 1
+ :caption: Analysis Tools
+
+ useful_tools/print_config.md
+ useful_tools/verify_dataset.md
+ useful_tools/log_result_analysis.md
+ useful_tools/complexity_analysis.md
+ useful_tools/confusion_matrix.md
+ useful_tools/shape_bias.md
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Deployment
+
+ useful_tools/model_serving.md
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Migration
+
+ migration.md
+
+.. toctree::
+ :maxdepth: 1
+ :caption: API Reference
+
+ mmpretrain.apis
+ mmpretrain.engine
+ mmpretrain.datasets
+ Data Process
+ mmpretrain.models
+ mmpretrain.structures
+ mmpretrain.visualization
+ mmpretrain.evaluation
+ mmpretrain.utils
+
+.. _Notes:
+.. toctree::
+ :maxdepth: 1
+ :caption: Notes
+
+ notes/contribution_guide.md
+ notes/projects.md
+ notes/changelog.md
+ notes/faq.md
+ notes/pretrain_custom_dataset.md
+ notes/finetune_custom_dataset.md
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Device Support
+
+ device/npu.md
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`search`
diff --git a/docs/en/migration.md b/docs/en/migration.md
new file mode 100644
index 0000000000000000000000000000000000000000..bdebdf6f5a9b454f94b5c66688f33d429544669e
--- /dev/null
+++ b/docs/en/migration.md
@@ -0,0 +1,772 @@
+# Migration
+
+We introduce some modifications in MMPretrain 1.x, and some of them are BC-breacking. To migrate your projects from **MMClassification 0.x** or **MMSelfSup 0.x** smoothly, please read this tutorial.
+
+- [Migration](#migration)
+ - [New dependencies](#new-dependencies)
+- [General change of config](#general-change-of-config)
+ - [Schedule settings](#schedule-settings)
+ - [Runtime settings](#runtime-settings)
+ - [Other changes](#other-changes)
+- [Migration from MMClassification 0.x](#migration-from-mmclassification-0x)
+ - [Config files](#config-files)
+ - [Model settings](#model-settings)
+ - [Data settings](#data-settings)
+ - [Packages](#packages)
+ - [`mmpretrain.apis`](#mmpretrainapis)
+ - [`mmpretrain.core`](#mmpretraincore)
+ - [`mmpretrain.datasets`](#mmpretraindatasets)
+ - [`mmpretrain.models`](#mmpretrainmodels)
+ - [`mmpretrain.utils`](#mmpretrainutils)
+- [Migration from MMSelfSup 0.x](#migration-from-mmselfsup-0x)
+ - [Config](#config)
+ - [Dataset settings](#dataset-settings)
+ - [Model settings](#model-settings-1)
+ - [Package](#package)
+
+## New dependencies
+
+```{warning}
+MMPretrain 1.x has new package dependencies, and a new environment should be created for MMPretrain 1.x even if you already have a well-rounded MMClassification 0.x or MMSelfSup 0.x environment. Please refer to the [installation tutorial](./get_started.md) for the required package installation or install the packages manually.
+```
+
+1. [MMEngine](https://github.com/open-mmlab/mmengine): MMEngine is the core the OpenMMLab 2.0 architecture,
+ and we have split many compentents unrelated to computer vision from MMCV to MMEngine.
+2. [MMCV](https://github.com/open-mmlab/mmcv): The computer vision package of OpenMMLab. This is not a new
+ dependency, but it should be upgraded to version `2.0.0rc1` or above.
+3. [rich](https://github.com/Textualize/rich): A terminal formatting package, and we use it to enhance some
+ outputs in the terminal.
+4. [einops](https://github.com/arogozhnikov/einops): Operators for Einstein notations.
+
+# General change of config
+
+In this section, we introduce the general difference between old version(**MMClassification 0.x** or **MMSelfSup 0.x**) and **MMPretrain 1.x**.
+
+## Schedule settings
+
+| MMCls or MMSelfSup 0.x | MMPretrain 1.x | Remark |
+| ---------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------- |
+| optimizer_config | / | It has been **removed**. |
+| / | optim_wrapper | The `optim_wrapper` provides a common interface for updating parameters. |
+| lr_config | param_scheduler | The `param_scheduler` is a list to set learning rate or other parameters, which is more flexible. |
+| runner | train_cfg | The loop setting (`EpochBasedTrainLoop`, `IterBasedTrainLoop`) in `train_cfg` controls the work flow of the algorithm training. |
+
+Changes in **`optimizer`** and **`optimizer_config`**:
+
+- Now we use `optim_wrapper` field to specify all configurations related to optimization process. The
+ `optimizer` has become a subfield of `optim_wrapper`.
+- The `paramwise_cfg` field is also a subfield of `optim_wrapper`, instead of `optimizer`.
+- The `optimizer_config` field has been removed, and all configurations has been moved to `optim_wrapper`.
+- The `grad_clip` field has been renamed to `clip_grad`.
+
+
+
+| Original |
+
+
+```python
+optimizer = dict(
+ type='AdamW',
+ lr=0.0015,
+ weight_decay=0.3,
+ paramwise_cfg = dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ ))
+
+optimizer_config = dict(grad_clip=dict(max_norm=1.0))
+```
+
+ |
+
+| New |
+
+
+```python
+optim_wrapper = dict(
+ optimizer=dict(type='AdamW', lr=0.0015, weight_decay=0.3),
+ paramwise_cfg = dict(
+ norm_decay_mult=0.0,
+ bias_decay_mult=0.0,
+ ),
+ clip_grad=dict(max_norm=1.0),
+)
+```
+
+ |
+
+
+
+Changes in **`lr_config`**:
+
+- The `lr_config` field has been removed and replaced by the new `param_scheduler`.
+- The `warmup` related arguments have also been removed since we use a combination of schedulers to implement this
+ functionality.
+
+The new scheduler combination mechanism is highly flexible and enables the design of various learning rate/momentum curves.
+For more details, see the {external+mmengine:doc}`parameter schedulers tutorial `.
+
+
+
+| Original |
+
+
+```python
+lr_config = dict(
+ policy='CosineAnnealing',
+ min_lr=0,
+ warmup='linear',
+ warmup_iters=5,
+ warmup_ratio=0.01,
+ warmup_by_epoch=True)
+```
+
+ |
+
+| New |
+
+
+```python
+param_scheduler = [
+ # warmup
+ dict(
+ type='LinearLR',
+ start_factor=0.01,
+ by_epoch=True,
+ end=5,
+ # Update the learning rate after every iters.
+ convert_to_iter_based=True),
+ # main learning rate scheduler
+ dict(type='CosineAnnealingLR', by_epoch=True, begin=5),
+]
+```
+
+ |
+
+
+
+Changes in **`runner`**:
+
+Most of the configurations that were originally in the `runner` field have been moved to `train_cfg`, `val_cfg`, and `test_cfg`.
+These fields are used to configure the loop for training, validation, and testing.
+
+
+
+| Original |
+
+
+```python
+runner = dict(type='EpochBasedRunner', max_epochs=100)
+```
+
+ |
+
+| New |
+
+
+```python
+# The `val_interval` is the original `evaluation.interval`.
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict() # Use the default validation loop.
+test_cfg = dict() # Use the default test loop.
+```
+
+ |
+
+
+
+In OpenMMLab 2.0, we introduced `Loop` to control the behaviors in training, validation and testing. As a result, the functionalities of `Runner` have also been changed.
+More details can be found in the {external+mmengine:doc}`MMEngine tutorials `.
+
+## Runtime settings
+
+Changes in **`checkpoint_config`** and **`log_config`**:
+
+The `checkpoint_config` has been moved to `default_hooks.checkpoint`, and `log_config` has been moved to
+`default_hooks.logger`. Additionally, many hook settings that were previously included in the script code have
+been moved to the `default_hooks` field in the runtime configuration.
+
+```python
+default_hooks = dict(
+ # record the time of every iterations.
+ timer=dict(type='IterTimerHook'),
+
+ # print log every 100 iterations.
+ logger=dict(type='LoggerHook', interval=100),
+
+ # enable the parameter scheduler.
+ param_scheduler=dict(type='ParamSchedulerHook'),
+
+ # save checkpoint per epoch, and automatically save the best checkpoint.
+ checkpoint=dict(type='CheckpointHook', interval=1, save_best='auto'),
+
+ # set sampler seed in distributed evrionment.
+ sampler_seed=dict(type='DistSamplerSeedHook'),
+
+ # validation results visualization, set True to enable it.
+ visualization=dict(type='VisualizationHook', enable=False),
+)
+```
+
+In OpenMMLab 2.0, we have split the original logger into logger and visualizer. The logger is used to record
+information, while the visualizer is used to display the logger in different backends such as terminal,
+TensorBoard, and Wandb.
+
+
+
+| Original |
+
+
+```python
+log_config = dict(
+ interval=100,
+ hooks=[
+ dict(type='TextLoggerHook'),
+ dict(type='TensorboardLoggerHook'),
+ ])
+```
+
+ |
+
+| New |
+
+
+```python
+default_hooks = dict(
+ ...
+ logger=dict(type='LoggerHook', interval=100),
+)
+
+visualizer = dict(
+ type='UniversalVisualizer',
+ vis_backends=[dict(type='LocalVisBackend'), dict(type='TensorboardVisBackend')],
+)
+```
+
+ |
+
+
+
+Changes in **`load_from`** and **`resume_from`**:
+
+- The `resume_from` is removed. And we use `resume` and `load_from` to replace it.
+ - If `resume=True` and `load_from` is not None, resume training from the checkpoint in `load_from`.
+ - If `resume=True` and `load_from` is None, try to resume from the latest checkpoint in the work directory.
+ - If `resume=False` and `load_from` is not None, only load the checkpoint, not resume training.
+ - If `resume=False` and `load_from` is None, do not load nor resume.
+
+the `resume_from` field has been removed, and we use `resume` and `load_from` instead.
+
+- If `resume=True` and `load_from` is not None, training is resumed from the checkpoint in `load_from`.
+- If `resume=True` and `load_from` is None, the latest checkpoint in the work directory is used for resuming.
+- If `resume=False` and `load_from` is not None, only the checkpoint is loaded without resuming training.
+- If `resume=False` and `load_from` is None, neither checkpoint is loaded nor is training resumed.
+
+Changes in **`dist_params`**: The `dist_params` field has become a subfield of `env_cfg` now.
+Additionally, some new configurations have been added to `env_cfg`.
+
+```python
+env_cfg = dict(
+ # whether to enable cudnn benchmark
+ cudnn_benchmark=False,
+
+ # set multi process parameters
+ mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+ # set distributed parameters
+ dist_cfg=dict(backend='nccl'),
+)
+```
+
+Changes in **`workflow`**: `workflow` related functionalities are removed.
+
+New field **`visualizer`**: The visualizer is a new design in OpenMMLab 2.0 architecture. The runner uses an
+instance of the visualizer to handle result and log visualization, as well as to save to different backends.
+For more information, please refer to the {external+mmengine:doc}`MMEngine tutorial `.
+
+```python
+visualizer = dict(
+ type='UniversalVisualizer',
+ vis_backends=[
+ dict(type='LocalVisBackend'),
+ # Uncomment the below line to save the log and visualization results to TensorBoard.
+ # dict(type='TensorboardVisBackend')
+ ]
+)
+```
+
+New field **`default_scope`**: The start point to search module for all registries. The `default_scope` in MMPretrain is `mmpretrain`. See {external+mmengine:doc}`the registry tutorial ` for more details.
+
+## Other changes
+
+We moved the definition of all registries in different packages to the `mmpretrain.registry` package.
+
+# Migration from MMClassification 0.x
+
+## Config files
+
+In MMPretrain 1.x, we refactored the structure of configuration files, and the original files are not usable.
+
+In this section, we will introduce all changes of the configuration files. And we assume you already have
+ideas of the [config files](./user_guides/config.md).
+
+### Model settings
+
+No changes in `model.backbone`, `model.neck` and `model.head` fields.
+
+Changes in **`model.train_cfg`**:
+
+- `BatchMixup` is renamed to [`Mixup`](mmpretrain.models.utils.batch_augments.Mixup).
+- `BatchCutMix` is renamed to [`CutMix`](mmpretrain.models.utils.batch_augments.CutMix).
+- `BatchResizeMix` is renamed to [`ResizeMix`](mmpretrain.models.utils.batch_augments.ResizeMix).
+- The `prob` argument is removed from all augments settings, and you can use the `probs` field in `train_cfg` to
+ specify probabilities of every augemnts. If no `probs` field, randomly choose one by the same probability.
+
+
+
+| Original |
+
+
+```python
+model = dict(
+ ...
+ train_cfg=dict(augments=[
+ dict(type='BatchMixup', alpha=0.8, num_classes=1000, prob=0.5),
+ dict(type='BatchCutMix', alpha=1.0, num_classes=1000, prob=0.5)
+ ]
+)
+```
+
+ |
+
+| New |
+
+
+```python
+model = dict(
+ ...
+ train_cfg=dict(augments=[
+ dict(type='Mixup', alpha=0.8), dict(type='CutMix', alpha=1.0)]
+)
+```
+
+ |
+
+
+
+### Data settings
+
+Changes in **`data`**:
+
+- The original `data` field is splited to `train_dataloader`, `val_dataloader` and
+ `test_dataloader`. This allows us to configure them in fine-grained. For example,
+ you can specify different sampler and batch size during training and test.
+- The `samples_per_gpu` is renamed to `batch_size`.
+- The `workers_per_gpu` is renamed to `num_workers`.
+
+
+
+| Original |
+
+
+```python
+data = dict(
+ samples_per_gpu=32,
+ workers_per_gpu=2,
+ train=dict(...),
+ val=dict(...),
+ test=dict(...),
+)
+```
+
+ |
+
+| New |
+
+
+```python
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=2,
+ dataset=dict(...),
+ sampler=dict(type='DefaultSampler', shuffle=True) # necessary
+)
+
+val_dataloader = dict(
+ batch_size=32,
+ num_workers=2,
+ dataset=dict(...),
+ sampler=dict(type='DefaultSampler', shuffle=False) # necessary
+)
+
+test_dataloader = val_dataloader
+```
+
+ |
+
+
+
+Changes in **`pipeline`**:
+
+- The original formatting transforms **`ToTensor`**, **`ImageToTensor`** and **`Collect`** are combined as [`PackInputs`](mmpretrain.datasets.transforms.PackInputs).
+- We don't recommend to do **`Normalize`** in the dataset pipeline. Please remove it from pipelines and set it in the `data_preprocessor` field.
+- The argument `flip_prob` in [**`RandomFlip`**](mmcv.transforms.RandomFlip) is renamed to `prob`.
+- The argument `size` in [**`RandomCrop`**](mmpretrain.datasets.transforms.RandomCrop) is renamed to `crop_size`.
+- The argument `size` in [**`RandomResizedCrop`**](mmpretrain.datasets.transforms.RandomResizedCrop) is renamed to `scale`.
+- The argument `size` in [**`Resize`**](mmcv.transforms.Resize) is renamed to `scale`. And `Resize` won't support size like `(256, -1)`, please use [`ResizeEdge`](mmpretrain.datasets.transforms.ResizeEdge) to replace it.
+- The argument `policies` in [**`AutoAugment`**](mmpretrain.datasets.transforms.AutoAugment) and [**`RandAugment`**](mmpretrain.datasets.transforms.RandAugment) supports using string to specify preset policies. `AutoAugment` supports "imagenet" and `RandAugment` supports "timm_increasing".
+- **`RandomResizedCrop`** and **`CenterCrop`** won't supports `efficientnet_style`, and please use [`EfficientNetRandomCrop`](mmpretrain.datasets.transforms.EfficientNetRandomCrop) and [`EfficientNetCenterCrop`](mmpretrain.datasets.transforms.EfficientNetCenterCrop) to replace them.
+
+```{note}
+We move some work of data transforms to the data preprocessor, like normalization, see [the documentation](mmpretrain.models.utils.data_preprocessor) for
+more details.
+```
+
+
+
+| Original |
+
+
+```python
+img_norm_cfg = dict(
+ mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', size=224),
+ dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+ dict(type='Normalize', **img_norm_cfg),
+ dict(type='ImageToTensor', keys=['img']),
+ dict(type='ToTensor', keys=['gt_label']),
+ dict(type='Collect', keys=['img', 'gt_label'])
+]
+```
+
+ |
+
+| New |
+
+
+```python
+data_preprocessor = dict(
+ mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+```
+
+ |
+
+
+
+Changes in **`evaluation`**:
+
+- The **`evaluation`** field is splited to `val_evaluator` and `test_evaluator`. And it won't supports `interval` and `save_best` arguments.
+ The `interval` is moved to `train_cfg.val_interval`, see [the schedule settings](./user_guides/config.md#schedule-settings) and the `save_best`
+ is moved to `default_hooks.checkpoint.save_best`, see [the runtime settings](./user_guides/config.md#runtime-settings).
+- The 'accuracy' metric is renamed to [`Accuracy`](mmpretrain.evaluation.Accuracy).
+- The 'precision', 'recall', 'f1-score' and 'support' are combined as [`SingleLabelMetric`](mmpretrain.evaluation.SingleLabelMetric), and use `items` argument to specify to calculate which metric.
+- The 'mAP' is renamed to [`AveragePrecision`](mmpretrain.evaluation.AveragePrecision).
+- The 'CP', 'CR', 'CF1', 'OP', 'OR', 'OF1' are combined as [`MultiLabelMetric`](mmpretrain.evaluation.MultiLabelMetric), and use `items` and `average` arguments to specify to calculate which metric.
+
+
+
+| Original |
+
+
+```python
+evaluation = dict(
+ interval=1,
+ metric='accuracy',
+ metric_options=dict(topk=(1, 5))
+)
+```
+
+ |
+
+| New |
+
+
+```python
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+test_evaluator = val_evaluator
+```
+
+ |
+
+
+| Original |
+
+
+```python
+evaluation = dict(
+ interval=1,
+ metric=['mAP', 'CP', 'OP', 'CR', 'OR', 'CF1', 'OF1'],
+ metric_options=dict(thr=0.5),
+)
+```
+
+ |
+
+| New |
+
+
+```python
+val_evaluator = [
+ dict(type='AveragePrecision'),
+ dict(type='MultiLabelMetric',
+ items=['precision', 'recall', 'f1-score'],
+ average='both',
+ thr=0.5),
+]
+test_evaluator = val_evaluator
+```
+
+ |
+
+
+
+## Packages
+
+### `mmpretrain.apis`
+
+The documentation can be found [here](mmpretrain.apis).
+
+| Function | Changes |
+| :------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------- |
+| `init_model` | No changes |
+| `inference_model` | No changes. But we recommend to use [`mmpretrain.ImageClassificationInferencer`](mmpretrain.apis.ImageClassificationInferencer) instead. |
+| `train_model` | Removed, use `runner.train` to train. |
+| `multi_gpu_test` | Removed, use `runner.test` to test. |
+| `single_gpu_test` | Removed, use `runner.test` to test. |
+| `show_result_pyplot` | Removed, use [`mmpretrain.ImageClassificationInferencer`](mmpretrain.apis.ImageClassificationInferencer) to inference model and show the result. |
+| `set_random_seed` | Removed, use `mmengine.runner.set_random_seed`. |
+| `init_random_seed` | Removed, use `mmengine.dist.sync_random_seed`. |
+
+### `mmpretrain.core`
+
+The `mmpretrain.core` package is renamed to [`mmpretrain.engine`](mmpretrain.engine).
+
+| Sub package | Changes |
+| :-------------: | :-------------------------------------------------------------------------------------------------------------------------------- |
+| `evaluation` | Removed, use the metrics in [`mmpretrain.evaluation`](mmpretrain.evaluation). |
+| `hook` | Moved to [`mmpretrain.engine.hooks`](mmpretrain.engine.hooks) |
+| `optimizers` | Moved to [`mmpretrain.engine.optimizers`](mmpretrain.engine.optimizers) |
+| `utils` | Removed, the distributed environment related functions can be found in the [`mmengine.dist`](api/dist) package. |
+| `visualization` | Removed, the related functionalities are implemented in [`mmengine.visualization.Visualizer`](mmengine.visualization.Visualizer). |
+
+The `MMClsWandbHook` in `hooks` package is waiting for implementation.
+
+The `CosineAnnealingCooldownLrUpdaterHook` in `hooks` package is removed, and we support this functionality by
+the combination of parameter schedulers, see [the tutorial](./advanced_guides/schedule.md).
+
+### `mmpretrain.datasets`
+
+The documentation can be found [here](mmpretrain.datasets).
+
+| Dataset class | Changes |
+| :---------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------- |
+| [`CustomDataset`](mmpretrain.datasets.CustomDataset) | Add `data_root` argument as the common prefix of `data_prefix` and `ann_file` and support to load unlabeled data. |
+| [`ImageNet`](mmpretrain.datasets.ImageNet) | Same as `CustomDataset`. |
+| [`ImageNet21k`](mmpretrain.datasets.ImageNet21k) | Same as `CustomDataset`. |
+| [`CIFAR10`](mmpretrain.datasets.CIFAR10) & [`CIFAR100`](mmpretrain.datasets.CIFAR100) | The `test_mode` argument is a required argument now. |
+| [`MNIST`](mmpretrain.datasets.MNIST) & [`FashionMNIST`](mmpretrain.datasets.FashionMNIST) | The `test_mode` argument is a required argument now. |
+| [`VOC`](mmpretrain.datasets.VOC) | Requires `data_root`, `image_set_path` and `test_mode` now. |
+| [`CUB`](mmpretrain.datasets.CUB) | Requires `data_root` and `test_mode` now. |
+
+The `mmpretrain.datasets.pipelines` is renamed to `mmpretrain.datasets.transforms`.
+
+| Transform class | Changes |
+| :-----------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `LoadImageFromFile` | Removed, use [`mmcv.transforms.LoadImageFromFile`](mmcv.transforms.LoadImageFromFile). |
+| `RandomFlip` | Removed, use [`mmcv.transforms.RandomFlip`](mmcv.transforms.RandomFlip). The argument `flip_prob` is renamed to `prob`. |
+| `RandomCrop` | The argument `size` is renamed to `crop_size`. |
+| `RandomResizedCrop` | The argument `size` is renamed to `scale`. The argument `scale` is renamed to `crop_ratio_range`. Won't support `efficientnet_style`, use [`EfficientNetRandomCrop`](mmpretrain.datasets.transforms.EfficientNetRandomCrop). |
+| `CenterCrop` | Removed, use [`mmcv.transforms.CenterCrop`](mmcv.transforms.CenterCrop). Won't support `efficientnet_style`, use [`EfficientNetCenterCrop`](mmpretrain.datasets.transforms.EfficientNetCenterCrop). |
+| `Resize` | Removed, use [`mmcv.transforms.Resize`](mmcv.transforms.Resize). The argument `size` is renamed to `scale`. Won't support size like `(256, -1)`, use [`ResizeEdge`](mmpretrain.datasets.transforms.ResizeEdge). |
+| `AutoAugment` & `RandomAugment` | The argument `policies` supports using string to specify preset policies. |
+| `Compose` | Removed, use [`mmcv.transforms.Compose`](mmcv.transforms.Compose). |
+
+### `mmpretrain.models`
+
+The documentation can be found [here](mmpretrain.models). The interface of all **backbones**, **necks** and **losses** didn't change.
+
+Changes in [`ImageClassifier`](mmpretrain.models.classifiers.ImageClassifier):
+
+| Method of classifiers | Changes |
+| :-------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `extract_feat` | No changes |
+| `forward` | Now only accepts three arguments: `inputs`, `data_samples` and `mode`. See [the documentation](mmpretrain.models.classifiers.ImageClassifier.forward) for more details. |
+| `forward_train` | Replaced by `loss`. |
+| `simple_test` | Replaced by `predict`. |
+| `train_step` | The `optimizer` argument is replaced by `optim_wrapper` and it accepts [`OptimWrapper`](mmengine.optim.OptimWrapper). |
+| `val_step` | The original `val_step` is the same as `train_step`, now it calls `predict`. |
+| `test_step` | New method, and it's the same as `val_step`. |
+
+Changes in [heads](mmpretrain.models.heads):
+
+| Method of heads | Changes |
+| :-------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `pre_logits` | No changes |
+| `forward_train` | Replaced by `loss`. |
+| `simple_test` | Replaced by `predict`. |
+| `loss` | It accepts `data_samples` instead of `gt_labels` to calculate loss. The `data_samples` is a list of [ClsDataSample](mmpretrain.structures.DataSample). |
+| `forward` | New method, and it returns the output of the classification head without any post-processs like softmax or sigmoid. |
+
+### `mmpretrain.utils`
+
+| Function | Changes |
+| :--------------------------: | :-------------------------------------------------------------------------------------------------------------- |
+| `collect_env` | No changes |
+| `get_root_logger` | Removed, use [`mmengine.logging.MMLogger.get_current_instance`](mmengine.logging.MMLogger.get_current_instance) |
+| `load_json_log` | The output format changed. |
+| `setup_multi_processes` | Removed, use [`mmengine.utils.dl_utils.set_multi_processing`](mmengine.utils.dl_utils.set_multi_processing). |
+| `wrap_non_distributed_model` | Removed, we auto wrap the model in the runner. |
+| `wrap_distributed_model` | Removed, we auto wrap the model in the runner. |
+| `auto_select_device` | Removed, we auto select the device in the runner. |
+
+# Migration from MMSelfSup 0.x
+
+## Config
+
+This section illustrates the changes of our config files in the `_base_` folder, which includes three parts
+
+- Datasets: `configs/_base_/datasets`
+- Models: `configs/_base_/models`
+- Schedules: `configs/_base_/schedules`
+
+### Dataset settings
+
+In **MMSelfSup 0.x**, we use key `data` to summarize all information, such as `samples_per_gpu`, `train`, `val`, etc.
+
+In **MMPretrain 1.x**, we separate `train_dataloader`, `val_dataloader` to summarize information correspodingly and the key `data` has been **removed**.
+
+
+
+| Original |
+
+
+```python
+data = dict(
+ samples_per_gpu=32, # total 32*8(gpu)=256
+ workers_per_gpu=4,
+ train=dict(
+ type=dataset_type,
+ data_source=dict(
+ type=data_source,
+ data_prefix='data/imagenet/train',
+ ann_file='data/imagenet/meta/train.txt',
+ ),
+ num_views=[1, 1],
+ pipelines=[train_pipeline1, train_pipeline2],
+ prefetch=prefetch,
+ ),
+ val=...)
+```
+
+ |
+
+
+| New |
+
+
+```python
+train_dataloader = dict(
+ batch_size=32,
+ num_workers=4,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ collate_fn=dict(type='default_collate'),
+ dataset=dict(
+ type=dataset_type,
+ data_root=data_root,
+ ann_file='meta/train.txt',
+ data_prefix=dict(img_path='train/'),
+ pipeline=train_pipeline))
+val_dataloader = ...
+```
+
+ |
+
+
+
+Besides, we **remove** the key of `data_source` to keep the pipeline format consistent with that in other OpenMMLab projects. Please refer to [Config](user_guides/config.md) for more details.
+
+Changes in **`pipeline`**:
+
+Take MAE as an example of `pipeline`:
+
+```python
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandomResizedCrop',
+ scale=224,
+ crop_ratio_range=(0.2, 1.0),
+ backend='pillow',
+ interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5),
+ dict(type='PackInputs')
+]
+```
+
+### Model settings
+
+In the config of models, there are two main different parts from MMSeflSup 0.x.
+
+1. There is a new key called `data_preprocessor`, which is responsible for preprocessing the data, like normalization, channel conversion, etc. For example:
+
+```python
+data_preprocessor=dict(
+ mean=[123.675, 116.28, 103.53],
+ std=[58.395, 57.12, 57.375],
+ bgr_to_rgb=True)
+model = dict(
+ type='MAE',
+ data_preprocessor=dict(
+ mean=[127.5, 127.5, 127.5],
+ std=[127.5, 127.5, 127.5],
+ bgr_to_rgb=True),
+ backbone=...,
+ neck=...,
+ head=...,
+ init_cfg=...)
+```
+
+2. There is a new key `loss` in `head` in MMPretrain 1.x, to determine the loss function of the algorithm. For example:
+
+```python
+model = dict(
+ type='MAE',
+ backbone=...,
+ neck=...,
+ head=dict(
+ type='MAEPretrainHead',
+ norm_pix=True,
+ patch_size=16,
+ loss=dict(type='MAEReconstructionLoss')),
+ init_cfg=...)
+```
+
+## Package
+
+The table below records the general modification of the folders and files.
+
+| MMSelfSup 0.x | MMPretrain 1.x | Remark |
+| ------------------------ | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| apis | apis | The high level APIs are updated. |
+| core | engine | The `core` folder has been renamed to `engine`, which includes `hooks`, `opimizers`. ([API link](mmpretrain.engine)) |
+| datasets | datasets | The datasets is implemented according to different datasets, such as ImageNet, Places205. ([API link](mmpretrain.datasets)) |
+| datasets/data_sources | / | The `data_sources` has been **removed** and the directory of `datasets` now is consistent with other OpenMMLab projects. |
+| datasets/pipelines | datasets/transforms | The `pipelines` folder has been renamed to `transforms`. ([API link](mmpretrain.datasets.transforms)) |
+| / | evaluation | The `evaluation` is created for some evaluation functions or classes. ([API link](mmpretrain.evaluation)) |
+| models/algorithms | selfsup | The algorithms are moved to `selfsup` folder. ([API link](mmpretrain.models.selfsup)) |
+| models/backbones | selfsup | The re-implemented backbones are moved to corresponding self-supervised learning algorithm `.py` files. ([API link](mmpretrain.models.selfsup)) |
+| models/target_generators | selfsup | The target generators are moved to corresponding self-supervised learning algorithm `.py` files. ([API link](mmpretrain.models.selfsup)) |
+| / | models/losses | The `losses` folder is created to provide different loss implementations, which is from `heads`. ([API link](mmpretrain.models.losses)) |
+| / | structures | The `structures` folder is for the implementation of data structures. In MMPretrain, we implement a new data structure, `DataSample`, to pass and receive data throughout the training/val process. ([API link](mmpretrain.structures)) |
+| / | visualization | The `visualization` folder contains the visualizer, which is responsible for some visualization tasks like visualizing data augmentation. ([API link](mmpretrain.visualization)) |
diff --git a/docs/en/notes/changelog.md b/docs/en/notes/changelog.md
new file mode 100644
index 0000000000000000000000000000000000000000..499ed24f64941e6731aeaa16fe492307ef4e0e4f
--- /dev/null
+++ b/docs/en/notes/changelog.md
@@ -0,0 +1,1055 @@
+# Changelog (MMPreTrain)
+
+## v1.2.0(04/01/2024)
+
+### New Features
+
+- [Feature] Support LLaVA 1.5 ([#1853](https://github.com/open-mmlab/mmpretrain/pull/1853))
+- [Feature] Implement of RAM with a gradio interface. ([#1802](https://github.com/open-mmlab/mmpretrain/pull/1802))
+
+### Bug Fix
+
+- [Fix] Fix resize mix argument bug.
+
+## v1.1.0(12/10/2023)
+
+### New Features
+
+- [Feature] Implement of Zero-Shot CLIP Classifier ([#1737](https://github.com/open-mmlab/mmpretrain/pull/1737))
+- [Feature] Add minigpt4 gradio demo and training script. ([#1758](https://github.com/open-mmlab/mmpretrain/pull/1758))
+
+### Improvements
+
+- [Config] New Version of config Adapting MobileNet Algorithm ([#1774](https://github.com/open-mmlab/mmpretrain/pull/1774))
+- [Config] Support DINO self-supervised learning in project ([#1756](https://github.com/open-mmlab/mmpretrain/pull/1756))
+- [Config] New Version of config Adapting Swin Transformer Algorithm ([#1780](https://github.com/open-mmlab/mmpretrain/pull/1780))
+- [Enhance] Add iTPN Supports for Non-three channel image ([#1735](https://github.com/open-mmlab/mmpretrain/pull/1735))
+- [Docs] Update dataset download script from opendatalab to openXlab ([#1765](https://github.com/open-mmlab/mmpretrain/pull/1765))
+- [Docs] Update COCO-Retrieval dataset docs. ([#1806](https://github.com/open-mmlab/mmpretrain/pull/1806))
+
+### Bug Fix
+
+- Update `train.py` to compat with new config.
+- Update OFA module to compat with the latest huggingface.
+- Fix pipeline bug in ImageRetrievalInferencer.
+
+## v1.0.2(15/08/2023)
+
+### New Features
+
+- Add MFF ([#1725](https://github.com/open-mmlab/mmpretrain/pull/1725))
+- Support training of BLIP2 ([#1700](https://github.com/open-mmlab/mmpretrain/pull/1700))
+
+### Improvements
+
+- New Version of config Adapting MAE Algorithm ([#1750](https://github.com/open-mmlab/mmpretrain/pull/1750))
+- New Version of config Adapting ConvNeXt Algorithm ([#1760](https://github.com/open-mmlab/mmpretrain/pull/1760))
+- New version of config adapting BeitV2 Algorithm ([#1755](https://github.com/open-mmlab/mmpretrain/pull/1755))
+- Update `dataset_prepare.md` ([#1732](https://github.com/open-mmlab/mmpretrain/pull/1732))
+- New Version of `config` Adapting Vision Transformer Algorithm ([#1727](https://github.com/open-mmlab/mmpretrain/pull/1727))
+- Support Infographic VQA dataset and ANLS metric. ([#1667](https://github.com/open-mmlab/mmpretrain/pull/1667))
+- Support IconQA dataset. ([#1670](https://github.com/open-mmlab/mmpretrain/pull/1670))
+- Fix typo MIMHIVIT to MAEHiViT ([#1749](https://github.com/open-mmlab/mmpretrain/pull/1749))
+
+## v1.0.1(28/07/2023)
+
+### Improvements
+
+- Add init_cfg with type='pretrained' to downstream tasks ([#1717](https://github.com/open-mmlab/mmpretrain/pull/1717)
+- Set 'is_init' in some multimodal methods ([#1718](https://github.com/open-mmlab/mmpretrain/pull/1718)
+- Adapt test cases on Ascend NPU ([#1728](https://github.com/open-mmlab/mmpretrain/pull/1728)
+- Add GPU Acceleration Apple silicon mac ([#1699](https://github.com/open-mmlab/mmpretrain/pull/1699)
+- BEiT refactor ([#1705](https://github.com/open-mmlab/mmpretrain/pull/1705)
+
+### Bug Fixes
+
+- Fix dict update in minigpt4. ([#1709](https://github.com/open-mmlab/mmpretrain/pull/1709)
+- Fix nested predict for multi-task prediction ([#1716](https://github.com/open-mmlab/mmpretrain/pull/1716)
+- Fix the issue #1711 "GaussianBlur doesn't work" ([#1722](https://github.com/open-mmlab/mmpretrain/pull/1722)
+- Just to correct a typo of 'target' ([#1655](https://github.com/open-mmlab/mmpretrain/pull/1655)
+- Fix freeze without cls_token in vit ([#1693](https://github.com/open-mmlab/mmpretrain/pull/1693)
+- Fix RandomCrop bug ([#1706](https://github.com/open-mmlab/mmpretrain/pull/1706)
+
+### Docs Update
+
+- Fix spelling ([#1689](https://github.com/open-mmlab/mmpretrain/pull/1689)
+
+## v1.0.0(04/07/2023)
+
+### Highlights
+
+- Support inference of more **multi-modal** algorithms, such as **LLaVA**, **MiniGPT-4**, **Otter**, etc.
+- Support around **10 multi-modal datasets**!
+- Add **iTPN**, **SparK** self-supervised learning algorithms.
+- Provide examples of [New Config](https://github.com/open-mmlab/mmpretrain/tree/main/mmpretrain/configs/) and [DeepSpeed/FSDP](https://github.com/open-mmlab/mmpretrain/tree/main/configs/mae/benchmarks/).
+
+### New Features
+
+- Transfer shape-bias tool from mmselfsup ([#1658](https://github.com/open-mmlab/mmpretrain/pull/1685))
+- Download dataset by using MIM&OpenDataLab ([#1630](https://github.com/open-mmlab/mmpretrain/pull/1630))
+- Support New Configs ([#1639](https://github.com/open-mmlab/mmpretrain/pull/1639), [#1647](https://github.com/open-mmlab/mmpretrain/pull/1647), [#1665](https://github.com/open-mmlab/mmpretrain/pull/1665))
+- Support Flickr30k Retrieval dataset ([#1625](https://github.com/open-mmlab/mmpretrain/pull/1625))
+- Support SparK ([#1531](https://github.com/open-mmlab/mmpretrain/pull/1531))
+- Support LLaVA ([#1652](https://github.com/open-mmlab/mmpretrain/pull/1652))
+- Support Otter ([#1651](https://github.com/open-mmlab/mmpretrain/pull/1651))
+- Support MiniGPT-4 ([#1642](https://github.com/open-mmlab/mmpretrain/pull/1642))
+- Add support for VizWiz dataset ([#1636](https://github.com/open-mmlab/mmpretrain/pull/1636))
+- Add support for vsr dataset ([#1634](https://github.com/open-mmlab/mmpretrain/pull/1634))
+- Add InternImage Classification project ([#1569](https://github.com/open-mmlab/mmpretrain/pull/1569))
+- Support OCR-VQA dataset ([#1621](https://github.com/open-mmlab/mmpretrain/pull/1621))
+- Support OK-VQA dataset ([#1615](https://github.com/open-mmlab/mmpretrain/pull/1615))
+- Support TextVQA dataset ([#1569](https://github.com/open-mmlab/mmpretrain/pull/1569))
+- Support iTPN and HiViT ([#1584](https://github.com/open-mmlab/mmpretrain/pull/1584))
+- Add retrieval mAP metric ([#1552](https://github.com/open-mmlab/mmpretrain/pull/1552))
+- Support NoCap dataset based on BLIP. ([#1582](https://github.com/open-mmlab/mmpretrain/pull/1582))
+- Add GQA dataset ([#1585](https://github.com/open-mmlab/mmpretrain/pull/1585))
+
+### Improvements
+
+- Update fsdp vit-huge and vit-large config ([#1675](https://github.com/open-mmlab/mmpretrain/pull/1675))
+- Support deepspeed with flexible runner ([#1673](https://github.com/open-mmlab/mmpretrain/pull/1673))
+- Update Otter and LLaVA docs and config. ([#1653](https://github.com/open-mmlab/mmpretrain/pull/1653))
+- Add image_only param of ScienceQA ([#1613](https://github.com/open-mmlab/mmpretrain/pull/1613))
+- Support to use "split" to specify training set/validation ([#1535](https://github.com/open-mmlab/mmpretrain/pull/1535))
+
+### Bug Fixes
+
+- Refactor \_prepare_pos_embed in ViT ([#1656](https://github.com/open-mmlab/mmpretrain/pull/1656), [#1679](https://github.com/open-mmlab/mmpretrain/pull/1679))
+- Freeze pre norm in vision transformer ([#1672](https://github.com/open-mmlab/mmpretrain/pull/1672))
+- Fix bug loading IN1k dataset ([#1641](https://github.com/open-mmlab/mmpretrain/pull/1641))
+- Fix sam bug ([#1633](https://github.com/open-mmlab/mmpretrain/pull/1633))
+- Fixed circular import error for new transform ([#1609](https://github.com/open-mmlab/mmpretrain/pull/1609))
+- Update torchvision transform wrapper ([#1595](https://github.com/open-mmlab/mmpretrain/pull/1595))
+- Set default out_type in CAM visualization ([#1586](https://github.com/open-mmlab/mmpretrain/pull/1586))
+
+### Docs Update
+
+- Fix spelling ([#1681](https://github.com/open-mmlab/mmpretrain/pull/1681))
+- Fix doc typos ([#1671](https://github.com/open-mmlab/mmpretrain/pull/1671), [#1644](https://github.com/open-mmlab/mmpretrain/pull/1644), [#1629](https://github.com/open-mmlab/mmpretrain/pull/1629))
+- Add t-SNE visualization doc ([#1555](https://github.com/open-mmlab/mmpretrain/pull/1555))
+
+## v1.0.0rc8(22/05/2023)
+
+### Highlights
+
+- Support multiple multi-modal algorithms and inferencers. You can explore these features by the [gradio demo](https://github.com/open-mmlab/mmpretrain/tree/main/projects/gradio_demo)!
+- Add EVA-02, Dino-V2, ViT-SAM and GLIP backbones.
+- Register torchvision transforms into MMPretrain, you can now easily integrate torchvision's data augmentations in MMPretrain.
+
+### New Features
+
+- Support Chinese CLIP. ([#1576](https://github.com/open-mmlab/mmpretrain/pull/1576))
+- Add ScienceQA Metrics ([#1577](https://github.com/open-mmlab/mmpretrain/pull/1577))
+- Support multiple multi-modal algorithms and inferencers. ([#1561](https://github.com/open-mmlab/mmpretrain/pull/1561))
+- add eva02 backbone ([#1450](https://github.com/open-mmlab/mmpretrain/pull/1450))
+- Support dinov2 backbone ([#1522](https://github.com/open-mmlab/mmpretrain/pull/1522))
+- Support some downstream classification datasets. ([#1467](https://github.com/open-mmlab/mmpretrain/pull/1467))
+- Support GLIP ([#1308](https://github.com/open-mmlab/mmpretrain/pull/1308))
+- Register torchvision transforms into mmpretrain ([#1265](https://github.com/open-mmlab/mmpretrain/pull/1265))
+- Add ViT of SAM ([#1476](https://github.com/open-mmlab/mmpretrain/pull/1476))
+
+### Improvements
+
+- [Refactor] Support to freeze channel reduction and add layer decay function ([#1490](https://github.com/open-mmlab/mmpretrain/pull/1490))
+- [Refactor] Support resizing pos_embed while loading ckpt and format output ([#1488](https://github.com/open-mmlab/mmpretrain/pull/1488))
+
+### Bug Fixes
+
+- Fix scienceqa ([#1581](https://github.com/open-mmlab/mmpretrain/pull/1581))
+- Fix config of beit ([#1528](https://github.com/open-mmlab/mmpretrain/pull/1528))
+- Incorrect stage freeze on RIFormer Model ([#1573](https://github.com/open-mmlab/mmpretrain/pull/1573))
+- Fix ddp bugs caused by `out_type`. ([#1570](https://github.com/open-mmlab/mmpretrain/pull/1570))
+- Fix multi-task-head loss potential bug ([#1530](https://github.com/open-mmlab/mmpretrain/pull/1530))
+- Support bce loss without batch augmentations ([#1525](https://github.com/open-mmlab/mmpretrain/pull/1525))
+- Fix clip generator init bug ([#1518](https://github.com/open-mmlab/mmpretrain/pull/1518))
+- Fix the bug in binary cross entropy loss ([#1499](https://github.com/open-mmlab/mmpretrain/pull/1499))
+
+### Docs Update
+
+- Update PoolFormer citation to CVPR version ([#1505](https://github.com/open-mmlab/mmpretrain/pull/1505))
+- Refine Inference Doc ([#1489](https://github.com/open-mmlab/mmpretrain/pull/1489))
+- Add doc for usage of confusion matrix ([#1513](https://github.com/open-mmlab/mmpretrain/pull/1513))
+- Update MMagic link ([#1517](https://github.com/open-mmlab/mmpretrain/pull/1517))
+- Fix example_project README ([#1575](https://github.com/open-mmlab/mmpretrain/pull/1575))
+- Add NPU support page ([#1481](https://github.com/open-mmlab/mmpretrain/pull/1481))
+- train cfg: Removed old description ([#1473](https://github.com/open-mmlab/mmpretrain/pull/1473))
+- Fix typo in MultiLabelDataset docstring ([#1483](https://github.com/open-mmlab/mmpretrain/pull/1483))
+
+## v1.0.0rc7(07/04/2023)
+
+### Highlights
+
+- Integrated Self-supervised learning algorithms from **MMSelfSup**, such as **MAE**, **BEiT**, etc.
+- Support **RIFormer**, a simple but effective vision backbone by removing token mixer.
+- Support **LeViT**, **XCiT**, **ViG** and **ConvNeXt-V2** backbone.
+- Add t-SNE visualization.
+- Refactor dataset pipeline visualization.
+- Support confusion matrix calculation and plot.
+
+### New Features
+
+- Support RIFormer. ([#1453](https://github.com/open-mmlab/mmpretrain/pull/1453))
+- Support XCiT Backbone. ([#1305](https://github.com/open-mmlab/mmclassification/pull/1305))
+- Support calculate confusion matrix and plot it. ([#1287](https://github.com/open-mmlab/mmclassification/pull/1287))
+- Support RetrieverRecall metric & Add ArcFace config ([#1316](https://github.com/open-mmlab/mmclassification/pull/1316))
+- Add `ImageClassificationInferencer`. ([#1261](https://github.com/open-mmlab/mmclassification/pull/1261))
+- Support InShop Dataset (Image Retrieval). ([#1019](https://github.com/open-mmlab/mmclassification/pull/1019))
+- Support LeViT backbone. ([#1238](https://github.com/open-mmlab/mmclassification/pull/1238))
+- Support VIG Backbone. ([#1304](https://github.com/open-mmlab/mmclassification/pull/1304))
+- Support ConvNeXt-V2 backbone. ([#1294](https://github.com/open-mmlab/mmclassification/pull/1294))
+
+### Improvements
+
+- Use PyTorch official `scaled_dot_product_attention` to accelerate `MultiheadAttention`. ([#1434](https://github.com/open-mmlab/mmpretrain/pull/1434))
+- Add ln to vit avg_featmap output ([#1447](https://github.com/open-mmlab/mmpretrain/pull/1447))
+- Update analysis tools and documentations. ([#1359](https://github.com/open-mmlab/mmclassification/pull/1359))
+- Unify the `--out` and `--dump` in `tools/test.py`. ([#1307](https://github.com/open-mmlab/mmclassification/pull/1307))
+- Enable to toggle whether Gem Pooling is trainable or not. ([#1246](https://github.com/open-mmlab/mmclassification/pull/1246))
+- Update registries of mmcls. ([#1306](https://github.com/open-mmlab/mmclassification/pull/1306))
+- Add metafile fill and validation tools. ([#1297](https://github.com/open-mmlab/mmclassification/pull/1297))
+- Remove useless EfficientnetV2 config files. ([#1300](https://github.com/open-mmlab/mmclassification/pull/1300))
+
+### Bug Fixes
+
+- Fix precise bn hook ([#1466](https://github.com/open-mmlab/mmpretrain/pull/1466))
+- Fix retrieval multi gpu bug ([#1319](https://github.com/open-mmlab/mmclassification/pull/1319))
+- Fix error repvgg-deploy base config path. ([#1357](https://github.com/open-mmlab/mmclassification/pull/1357))
+- Fix bug in test tools. ([#1309](https://github.com/open-mmlab/mmclassification/pull/1309))
+
+### Docs Update
+
+- Translate some tools tutorials to Chinese. ([#1321](https://github.com/open-mmlab/mmclassification/pull/1321))
+- Add Chinese translation for runtime.md. ([#1313](https://github.com/open-mmlab/mmclassification/pull/1313))
+
+# Changelog (MMClassification)
+
+## v1.0.0rc5(30/12/2022)
+
+### Highlights
+
+- Support EVA, RevViT, EfficientnetV2, CLIP, TinyViT and MixMIM backbones.
+- Reproduce the training accuracy of ConvNeXt and RepVGG.
+- Support multi-task training and testing.
+- Support Test-time Augmentation.
+
+### New Features
+
+- [Feature] Add EfficientnetV2 Backbone. ([#1253](https://github.com/open-mmlab/mmclassification/pull/1253))
+- [Feature] Support TTA and add `--tta` in `tools/test.py`. ([#1161](https://github.com/open-mmlab/mmclassification/pull/1161))
+- [Feature] Support Multi-task. ([#1229](https://github.com/open-mmlab/mmclassification/pull/1229))
+- [Feature] Add clip backbone. ([#1258](https://github.com/open-mmlab/mmclassification/pull/1258))
+- [Feature] Add mixmim backbone with checkpoints. ([#1224](https://github.com/open-mmlab/mmclassification/pull/1224))
+- [Feature] Add TinyViT for dev-1.x. ([#1042](https://github.com/open-mmlab/mmclassification/pull/1042))
+- [Feature] Add some scripts for development. ([#1257](https://github.com/open-mmlab/mmclassification/pull/1257))
+- [Feature] Support EVA. ([#1239](https://github.com/open-mmlab/mmclassification/pull/1239))
+- [Feature] Implementation of RevViT. ([#1127](https://github.com/open-mmlab/mmclassification/pull/1127))
+
+### Improvements
+
+- [Reproduce] Reproduce RepVGG Training Accuracy. ([#1264](https://github.com/open-mmlab/mmclassification/pull/1264))
+- [Enhance] Support ConvNeXt More Weights. ([#1240](https://github.com/open-mmlab/mmclassification/pull/1240))
+- [Reproduce] Update ConvNeXt config files. ([#1256](https://github.com/open-mmlab/mmclassification/pull/1256))
+- [CI] Update CI to test PyTorch 1.13.0. ([#1260](https://github.com/open-mmlab/mmclassification/pull/1260))
+- [Project] Add ACCV workshop 1st Solution. ([#1245](https://github.com/open-mmlab/mmclassification/pull/1245))
+- [Project] Add Example project. ([#1254](https://github.com/open-mmlab/mmclassification/pull/1254))
+
+### Bug Fixes
+
+- [Fix] Fix imports in transforms. ([#1255](https://github.com/open-mmlab/mmclassification/pull/1255))
+- [Fix] Fix CAM visualization. ([#1248](https://github.com/open-mmlab/mmclassification/pull/1248))
+- [Fix] Fix the requirements and lazy register mmpretrain models. ([#1275](https://github.com/open-mmlab/mmclassification/pull/1275))
+
+## v1.0.0rc4(06/12/2022)
+
+### Highlights
+
+- Upgrade API to get pre-defined models of MMClassification. See [#1236](https://github.com/open-mmlab/mmclassification/pull/1236) for more details.
+- Refactor BEiT backbone and support v1/v2 inference. See [#1144](https://github.com/open-mmlab/mmclassification/pull/1144).
+
+### New Features
+
+- Support getting model from the name defined in the model-index file. ([#1236](https://github.com/open-mmlab/mmclassification/pull/1236))
+
+### Improvements
+
+- Support evaluate on both EMA and non-EMA models. ([#1204](https://github.com/open-mmlab/mmclassification/pull/1204))
+- Refactor BEiT backbone and support v1/v2 inference. ([#1144](https://github.com/open-mmlab/mmclassification/pull/1144))
+
+### Bug Fixes
+
+- Fix `reparameterize_model.py` doesn't save meta info. ([#1221](https://github.com/open-mmlab/mmclassification/pull/1221))
+- Fix dict update in BEiT. ([#1234](https://github.com/open-mmlab/mmclassification/pull/1234))
+
+### Docs Update
+
+- Update install tutorial. ([#1223](https://github.com/open-mmlab/mmclassification/pull/1223))
+- Update MobileNetv2 & MobileNetv3 readme. ([#1222](https://github.com/open-mmlab/mmclassification/pull/1222))
+- Add version selection in the banner. ([#1217](https://github.com/open-mmlab/mmclassification/pull/1217))
+
+## v1.0.0rc3(21/11/2022)
+
+### Highlights
+
+- Add **Switch Recipe** Hook, Now we can modify training pipeline, mixup and loss settings during training, see [#1101](https://github.com/open-mmlab/mmclassification/pull/1101).
+- Add **TIMM and HuggingFace** wrappers. Now you can train/use models in TIMM/HuggingFace directly, see [#1102](https://github.com/open-mmlab/mmclassification/pull/1102).
+- Support **retrieval tasks**, see [#1055](https://github.com/open-mmlab/mmclassification/pull/1055).
+- Reproduce **mobileone** training accuracy. See [#1191](https://github.com/open-mmlab/mmclassification/pull/1191)
+
+### New Features
+
+- Add checkpoints from EfficientNets NoisyStudent & L2. ([#1122](https://github.com/open-mmlab/mmclassification/pull/1122))
+- Migrate CSRA head to 1.x. ([#1177](https://github.com/open-mmlab/mmclassification/pull/1177))
+- Support RepLKnet backbone. ([#1129](https://github.com/open-mmlab/mmclassification/pull/1129))
+- Add Switch Recipe Hook. ([#1101](https://github.com/open-mmlab/mmclassification/pull/1101))
+- Add adan optimizer. ([#1180](https://github.com/open-mmlab/mmclassification/pull/1180))
+- Support DaViT. ([#1105](https://github.com/open-mmlab/mmclassification/pull/1105))
+- Support Activation Checkpointing for ConvNeXt. ([#1153](https://github.com/open-mmlab/mmclassification/pull/1153))
+- Add TIMM and HuggingFace wrappers to build classifiers from them directly. ([#1102](https://github.com/open-mmlab/mmclassification/pull/1102))
+- Add reduction for neck ([#978](https://github.com/open-mmlab/mmclassification/pull/978))
+- Support HorNet Backbone for dev1.x. ([#1094](https://github.com/open-mmlab/mmclassification/pull/1094))
+- Add arcface head. ([#926](https://github.com/open-mmlab/mmclassification/pull/926))
+- Add Base Retriever and Image2Image Retriever for retrieval tasks. ([#1055](https://github.com/open-mmlab/mmclassification/pull/1055))
+- Support MobileViT backbone. ([#1068](https://github.com/open-mmlab/mmclassification/pull/1068))
+
+### Improvements
+
+- [Enhance] Enhance ArcFaceClsHead. ([#1181](https://github.com/open-mmlab/mmclassification/pull/1181))
+- [Refactor] Refactor to use new fileio API in MMEngine. ([#1176](https://github.com/open-mmlab/mmclassification/pull/1176))
+- [Enhance] Reproduce mobileone training accuracy. ([#1191](https://github.com/open-mmlab/mmclassification/pull/1191))
+- [Enhance] add deleting params info in swinv2. ([#1142](https://github.com/open-mmlab/mmclassification/pull/1142))
+- [Enhance] Add more mobilenetv3 pretrains. ([#1154](https://github.com/open-mmlab/mmclassification/pull/1154))
+- [Enhancement] RepVGG for YOLOX-PAI for dev-1.x. ([#1126](https://github.com/open-mmlab/mmclassification/pull/1126))
+- [Improve] Speed up data preprocessor. ([#1064](https://github.com/open-mmlab/mmclassification/pull/1064))
+
+### Bug Fixes
+
+- Fix the torchserve. ([#1143](https://github.com/open-mmlab/mmclassification/pull/1143))
+- Fix configs due to api refactor of `num_classes`. ([#1184](https://github.com/open-mmlab/mmclassification/pull/1184))
+- Update mmpretrain2torchserve. ([#1189](https://github.com/open-mmlab/mmclassification/pull/1189))
+- Fix for `inference_model` cannot get classes information in checkpoint. ([#1093](https://github.com/open-mmlab/mmclassification/pull/1093))
+
+### Docs Update
+
+- Add not-found page extension. ([#1207](https://github.com/open-mmlab/mmclassification/pull/1207))
+- update visualization doc. ([#1160](https://github.com/open-mmlab/mmclassification/pull/1160))
+- Support sort and search the Model Summary table. ([#1100](https://github.com/open-mmlab/mmclassification/pull/1100))
+- Improve the ResNet model page. ([#1118](https://github.com/open-mmlab/mmclassification/pull/1118))
+- update the readme of convnext. ([#1156](https://github.com/open-mmlab/mmclassification/pull/1156))
+- Fix the installation docs link in README. ([#1164](https://github.com/open-mmlab/mmclassification/pull/1164))
+- Improve ViT and MobileViT model pages. ([#1155](https://github.com/open-mmlab/mmclassification/pull/1155))
+- Improve Swin Doc and Add Tabs enxtation. ([#1145](https://github.com/open-mmlab/mmclassification/pull/1145))
+- Add MMEval projects link in README. ([#1162](https://github.com/open-mmlab/mmclassification/pull/1162))
+- Add runtime configuration docs. ([#1128](https://github.com/open-mmlab/mmclassification/pull/1128))
+- Add custom evaluation docs ([#1130](https://github.com/open-mmlab/mmclassification/pull/1130))
+- Add custom pipeline docs. ([#1124](https://github.com/open-mmlab/mmclassification/pull/1124))
+- Add MMYOLO projects link in MMCLS1.x. ([#1117](https://github.com/open-mmlab/mmclassification/pull/1117))
+
+## v1.0.0rc2(12/10/2022)
+
+### New Features
+
+- [Feature] Support DeiT3. ([#1065](https://github.com/open-mmlab/mmclassification/pull/1065))
+
+### Improvements
+
+- [Enhance] Update `analyze_results.py` for dev-1.x. ([#1071](https://github.com/open-mmlab/mmclassification/pull/1071))
+- [Enhance] Get scores from inference api. ([#1070](https://github.com/open-mmlab/mmclassification/pull/1070))
+
+### Bug Fixes
+
+- [Fix] Update requirements. ([#1083](https://github.com/open-mmlab/mmclassification/pull/1083))
+
+### Docs Update
+
+- [Docs] Add 1x docs schedule. ([#1015](https://github.com/open-mmlab/mmclassification/pull/1015))
+
+## v1.0.0rc1(30/9/2022)
+
+### New Features
+
+- Support MViT for MMCLS 1.x ([#1023](https://github.com/open-mmlab/mmclassification/pull/1023))
+- Add ViT huge architecture. ([#1049](https://github.com/open-mmlab/mmclassification/pull/1049))
+- Support EdgeNeXt for dev-1.x. ([#1037](https://github.com/open-mmlab/mmclassification/pull/1037))
+- Support Swin Transformer V2 for MMCLS 1.x. ([#1029](https://github.com/open-mmlab/mmclassification/pull/1029))
+- Add efficientformer Backbone for MMCls 1.x. ([#1031](https://github.com/open-mmlab/mmclassification/pull/1031))
+- Add MobileOne Backbone For MMCls 1.x. ([#1030](https://github.com/open-mmlab/mmclassification/pull/1030))
+- Support BEiT Transformer layer. ([#919](https://github.com/open-mmlab/mmclassification/pull/919))
+
+### Improvements
+
+- [Refactor] Fix visualization tools. ([#1045](https://github.com/open-mmlab/mmclassification/pull/1045))
+- [Improve] Update benchmark scripts ([#1028](https://github.com/open-mmlab/mmclassification/pull/1028))
+- [Improve] Update tools to enable `pin_memory` and `persistent_workers` by default. ([#1024](https://github.com/open-mmlab/mmclassification/pull/1024))
+- [CI] Update circle-ci and github workflow. ([#1018](https://github.com/open-mmlab/mmclassification/pull/1018))
+
+### Bug Fixes
+
+- Fix verify dataset tool in 1.x. ([#1062](https://github.com/open-mmlab/mmclassification/pull/1062))
+- Fix `loss_weight` in `LabelSmoothLoss`. ([#1058](https://github.com/open-mmlab/mmclassification/pull/1058))
+- Fix the output position of Swin-Transformer. ([#947](https://github.com/open-mmlab/mmclassification/pull/947))
+
+### Docs Update
+
+- Auto generate model summary table. ([#1010](https://github.com/open-mmlab/mmclassification/pull/1010))
+- Refactor new modules tutorial. ([#998](https://github.com/open-mmlab/mmclassification/pull/998))
+
+## v1.0.0rc0(31/8/2022)
+
+MMClassification 1.0.0rc0 is the first version of MMClassification 1.x, a part of the OpenMMLab 2.0 projects.
+
+Built upon the new [training engine](https://github.com/open-mmlab/mmengine), MMClassification 1.x unifies the interfaces of dataset, models, evaluation, and visualization.
+
+And there are some BC-breaking changes. Please check [the migration tutorial](https://mmclassification.readthedocs.io/en/1.x/migration.html) for more details.
+
+## v0.23.1(2/6/2022)
+
+### New Features
+
+- Dedicated MMClsWandbHook for MMClassification (Weights and Biases Integration) ([#764](https://github.com/open-mmlab/mmclassification/pull/764))
+
+### Improvements
+
+- Use mdformat instead of markdownlint to format markdown. ([#844](https://github.com/open-mmlab/mmclassification/pull/844))
+
+### Bug Fixes
+
+- Fix wrong `--local_rank`.
+
+### Docs Update
+
+- Update install tutorials. ([#854](https://github.com/open-mmlab/mmclassification/pull/854))
+- Fix wrong link in README. ([#835](https://github.com/open-mmlab/mmclassification/pull/835))
+
+## v0.23.0(1/5/2022)
+
+### New Features
+
+- Support DenseNet. ([#750](https://github.com/open-mmlab/mmclassification/pull/750))
+- Support VAN. ([#739](https://github.com/open-mmlab/mmclassification/pull/739))
+
+### Improvements
+
+- Support training on IPU and add fine-tuning configs of ViT. ([#723](https://github.com/open-mmlab/mmclassification/pull/723))
+
+### Docs Update
+
+- New style API reference, and easier to use! Welcome [view it](https://mmclassification.readthedocs.io/en/master/api/models.html). ([#774](https://github.com/open-mmlab/mmclassification/pull/774))
+
+## v0.22.1(15/4/2022)
+
+### New Features
+
+- [Feature] Support resize relative position embedding in `SwinTransformer`. ([#749](https://github.com/open-mmlab/mmclassification/pull/749))
+- [Feature] Add PoolFormer backbone and checkpoints. ([#746](https://github.com/open-mmlab/mmclassification/pull/746))
+
+### Improvements
+
+- [Enhance] Improve CPE performance by reduce memory copy. ([#762](https://github.com/open-mmlab/mmclassification/pull/762))
+- [Enhance] Add extra dataloader settings in configs. ([#752](https://github.com/open-mmlab/mmclassification/pull/752))
+
+## v0.22.0(30/3/2022)
+
+### Highlights
+
+- Support a series of CSP Network, such as CSP-ResNet, CSP-ResNeXt and CSP-DarkNet.
+- A new `CustomDataset` class to help you build dataset of yourself!
+- Support ConvMixer, RepMLP and new dataset - CUB dataset.
+
+### New Features
+
+- [Feature] Add CSPNet and backbone and checkpoints ([#735](https://github.com/open-mmlab/mmclassification/pull/735))
+- [Feature] Add `CustomDataset`. ([#738](https://github.com/open-mmlab/mmclassification/pull/738))
+- [Feature] Add diff seeds to diff ranks. ([#744](https://github.com/open-mmlab/mmclassification/pull/744))
+- [Feature] Support ConvMixer. ([#716](https://github.com/open-mmlab/mmclassification/pull/716))
+- [Feature] Our `dist_train` & `dist_test` tools support distributed training on multiple machines. ([#734](https://github.com/open-mmlab/mmclassification/pull/734))
+- [Feature] Add RepMLP backbone and checkpoints. ([#709](https://github.com/open-mmlab/mmclassification/pull/709))
+- [Feature] Support CUB dataset. ([#703](https://github.com/open-mmlab/mmclassification/pull/703))
+- [Feature] Support ResizeMix. ([#676](https://github.com/open-mmlab/mmclassification/pull/676))
+
+### Improvements
+
+- [Enhance] Use `--a-b` instead of `--a_b` in arguments. ([#754](https://github.com/open-mmlab/mmclassification/pull/754))
+- [Enhance] Add `get_cat_ids` and `get_gt_labels` to KFoldDataset. ([#721](https://github.com/open-mmlab/mmclassification/pull/721))
+- [Enhance] Set torch seed in `worker_init_fn`. ([#733](https://github.com/open-mmlab/mmclassification/pull/733))
+
+### Bug Fixes
+
+- [Fix] Fix the discontiguous output feature map of ConvNeXt. ([#743](https://github.com/open-mmlab/mmclassification/pull/743))
+
+### Docs Update
+
+- [Docs] Add brief installation steps in README for copy&paste. ([#755](https://github.com/open-mmlab/mmclassification/pull/755))
+- [Docs] fix logo url link from mmocr to mmpretrain. ([#732](https://github.com/open-mmlab/mmclassification/pull/732))
+
+## v0.21.0(04/03/2022)
+
+### Highlights
+
+- Support ResNetV1c and Wide-ResNet, and provide pre-trained models.
+- Support dynamic input shape for ViT-based algorithms. Now our ViT, DeiT, Swin-Transformer and T2T-ViT support forwarding with any input shape.
+- Reproduce training results of DeiT. And our DeiT-T and DeiT-S have higher accuracy comparing with the official weights.
+
+### New Features
+
+- Add ResNetV1c. ([#692](https://github.com/open-mmlab/mmclassification/pull/692))
+- Support Wide-ResNet. ([#715](https://github.com/open-mmlab/mmclassification/pull/715))
+- Support gem pooling ([#677](https://github.com/open-mmlab/mmclassification/pull/677))
+
+### Improvements
+
+- Reproduce training results of DeiT. ([#711](https://github.com/open-mmlab/mmclassification/pull/711))
+- Add ConvNeXt pretrain models on ImageNet-1k. ([#707](https://github.com/open-mmlab/mmclassification/pull/707))
+- Support dynamic input shape for ViT-based algorithms. ([#706](https://github.com/open-mmlab/mmclassification/pull/706))
+- Add `evaluate` function for ConcatDataset. ([#650](https://github.com/open-mmlab/mmclassification/pull/650))
+- Enhance vis-pipeline tool. ([#604](https://github.com/open-mmlab/mmclassification/pull/604))
+- Return code 1 if scripts runs failed. ([#694](https://github.com/open-mmlab/mmclassification/pull/694))
+- Use PyTorch official `one_hot` to implement `convert_to_one_hot`. ([#696](https://github.com/open-mmlab/mmclassification/pull/696))
+- Add a new pre-commit-hook to automatically add a copyright. ([#710](https://github.com/open-mmlab/mmclassification/pull/710))
+- Add deprecation message for deploy tools. ([#697](https://github.com/open-mmlab/mmclassification/pull/697))
+- Upgrade isort pre-commit hooks. ([#687](https://github.com/open-mmlab/mmclassification/pull/687))
+- Use `--gpu-id` instead of `--gpu-ids` in non-distributed multi-gpu training/testing. ([#688](https://github.com/open-mmlab/mmclassification/pull/688))
+- Remove deprecation. ([#633](https://github.com/open-mmlab/mmclassification/pull/633))
+
+### Bug Fixes
+
+- Fix Conformer forward with irregular input size. ([#686](https://github.com/open-mmlab/mmclassification/pull/686))
+- Add `dist.barrier` to fix a bug in directory checking. ([#666](https://github.com/open-mmlab/mmclassification/pull/666))
+
+## v0.20.1(07/02/2022)
+
+### Bug Fixes
+
+- Fix the MMCV dependency version.
+
+## v0.20.0(30/01/2022)
+
+### Highlights
+
+- Support K-fold cross-validation. The tutorial will be released later.
+- Support HRNet, ConvNeXt, Twins and EfficientNet.
+- Support model conversion from PyTorch to Core-ML by a tool.
+
+### New Features
+
+- Support K-fold cross-validation. ([#563](https://github.com/open-mmlab/mmclassification/pull/563))
+- Support HRNet and add pre-trained models. ([#660](https://github.com/open-mmlab/mmclassification/pull/660))
+- Support ConvNeXt and add pre-trained models. ([#670](https://github.com/open-mmlab/mmclassification/pull/670))
+- Support Twins and add pre-trained models. ([#642](https://github.com/open-mmlab/mmclassification/pull/642))
+- Support EfficientNet and add pre-trained models.([#649](https://github.com/open-mmlab/mmclassification/pull/649))
+- Support `features_only` option in `TIMMBackbone`. ([#668](https://github.com/open-mmlab/mmclassification/pull/668))
+- Add conversion script from pytorch to Core-ML model. ([#597](https://github.com/open-mmlab/mmclassification/pull/597))
+
+### Improvements
+
+- New-style CPU training and inference. ([#674](https://github.com/open-mmlab/mmclassification/pull/674))
+- Add setup multi-processing both in train and test. ([#671](https://github.com/open-mmlab/mmclassification/pull/671))
+- Rewrite channel split operation in ShufflenetV2. ([#632](https://github.com/open-mmlab/mmclassification/pull/632))
+- Deprecate the support for "python setup.py test". ([#646](https://github.com/open-mmlab/mmclassification/pull/646))
+- Support single-label, softmax, custom eps by asymmetric loss. ([#609](https://github.com/open-mmlab/mmclassification/pull/609))
+- Save class names in best checkpoint created by evaluation hook. ([#641](https://github.com/open-mmlab/mmclassification/pull/641))
+
+### Bug Fixes
+
+- Fix potential unexcepted behaviors if `metric_options` is not specified in multi-label evaluation. ([#647](https://github.com/open-mmlab/mmclassification/pull/647))
+- Fix API changes in `pytorch-grad-cam>=1.3.7`. ([#656](https://github.com/open-mmlab/mmclassification/pull/656))
+- Fix bug which breaks `cal_train_time` in `analyze_logs.py`. ([#662](https://github.com/open-mmlab/mmclassification/pull/662))
+
+### Docs Update
+
+- Update README in configs according to OpenMMLab standard. ([#672](https://github.com/open-mmlab/mmclassification/pull/672))
+- Update installation guide and README. ([#624](https://github.com/open-mmlab/mmclassification/pull/624))
+
+## v0.19.0(31/12/2021)
+
+### Highlights
+
+- The feature extraction function has been enhanced. See [#593](https://github.com/open-mmlab/mmclassification/pull/593) for more details.
+- Provide the high-acc ResNet-50 training settings from [*ResNet strikes back*](https://arxiv.org/abs/2110.00476).
+- Reproduce the training accuracy of T2T-ViT & RegNetX, and provide self-training checkpoints.
+- Support DeiT & Conformer backbone and checkpoints.
+- Provide a CAM visualization tool based on [pytorch-grad-cam](https://github.com/jacobgil/pytorch-grad-cam), and detailed [user guide](https://mmclassification.readthedocs.io/en/latest/tools/visualization.html#class-activation-map-visualization)!
+
+### New Features
+
+- Support Precise BN. ([#401](https://github.com/open-mmlab/mmclassification/pull/401))
+- Add CAM visualization tool. ([#577](https://github.com/open-mmlab/mmclassification/pull/577))
+- Repeated Aug and Sampler Registry. ([#588](https://github.com/open-mmlab/mmclassification/pull/588))
+- Add DeiT backbone and checkpoints. ([#576](https://github.com/open-mmlab/mmclassification/pull/576))
+- Support LAMB optimizer. ([#591](https://github.com/open-mmlab/mmclassification/pull/591))
+- Implement the conformer backbone. ([#494](https://github.com/open-mmlab/mmclassification/pull/494))
+- Add the frozen function for Swin Transformer model. ([#574](https://github.com/open-mmlab/mmclassification/pull/574))
+- Support using checkpoint in Swin Transformer to save memory. ([#557](https://github.com/open-mmlab/mmclassification/pull/557))
+
+### Improvements
+
+- [Reproduction] Reproduce RegNetX training accuracy. ([#587](https://github.com/open-mmlab/mmclassification/pull/587))
+- [Reproduction] Reproduce training results of T2T-ViT. ([#610](https://github.com/open-mmlab/mmclassification/pull/610))
+- [Enhance] Provide high-acc training settings of ResNet. ([#572](https://github.com/open-mmlab/mmclassification/pull/572))
+- [Enhance] Set a random seed when the user does not set a seed. ([#554](https://github.com/open-mmlab/mmclassification/pull/554))
+- [Enhance] Added `NumClassCheckHook` and unit tests. ([#559](https://github.com/open-mmlab/mmclassification/pull/559))
+- [Enhance] Enhance feature extraction function. ([#593](https://github.com/open-mmlab/mmclassification/pull/593))
+- [Enhance] Improve efficiency of precision, recall, f1_score and support. ([#595](https://github.com/open-mmlab/mmclassification/pull/595))
+- [Enhance] Improve accuracy calculation performance. ([#592](https://github.com/open-mmlab/mmclassification/pull/592))
+- [Refactor] Refactor `analysis_log.py`. ([#529](https://github.com/open-mmlab/mmclassification/pull/529))
+- [Refactor] Use new API of matplotlib to handle blocking input in visualization. ([#568](https://github.com/open-mmlab/mmclassification/pull/568))
+- [CI] Cancel previous runs that are not completed. ([#583](https://github.com/open-mmlab/mmclassification/pull/583))
+- [CI] Skip build CI if only configs or docs modification. ([#575](https://github.com/open-mmlab/mmclassification/pull/575))
+
+### Bug Fixes
+
+- Fix test sampler bug. ([#611](https://github.com/open-mmlab/mmclassification/pull/611))
+- Try to create a symbolic link, otherwise copy. ([#580](https://github.com/open-mmlab/mmclassification/pull/580))
+- Fix a bug for multiple output in swin transformer. ([#571](https://github.com/open-mmlab/mmclassification/pull/571))
+
+### Docs Update
+
+- Update mmcv, torch, cuda version in Dockerfile and docs. ([#594](https://github.com/open-mmlab/mmclassification/pull/594))
+- Add analysis&misc docs. ([#525](https://github.com/open-mmlab/mmclassification/pull/525))
+- Fix docs build dependency. ([#584](https://github.com/open-mmlab/mmclassification/pull/584))
+
+## v0.18.0(30/11/2021)
+
+### Highlights
+
+- Support MLP-Mixer backbone and provide pre-trained checkpoints.
+- Add a tool to visualize the learning rate curve of the training phase. Welcome to use with the [tutorial](https://mmclassification.readthedocs.io/en/latest/tools/visualization.html#learning-rate-schedule-visualization)!
+
+### New Features
+
+- Add MLP Mixer Backbone. ([#528](https://github.com/open-mmlab/mmclassification/pull/528), [#539](https://github.com/open-mmlab/mmclassification/pull/539))
+- Support positive weights in BCE. ([#516](https://github.com/open-mmlab/mmclassification/pull/516))
+- Add a tool to visualize learning rate in each iterations. ([#498](https://github.com/open-mmlab/mmclassification/pull/498))
+
+### Improvements
+
+- Use CircleCI to do unit tests. ([#567](https://github.com/open-mmlab/mmclassification/pull/567))
+- Focal loss for single label tasks. ([#548](https://github.com/open-mmlab/mmclassification/pull/548))
+- Remove useless `import_modules_from_string`. ([#544](https://github.com/open-mmlab/mmclassification/pull/544))
+- Rename config files according to the config name standard. ([#508](https://github.com/open-mmlab/mmclassification/pull/508))
+- Use `reset_classifier` to remove head of timm backbones. ([#534](https://github.com/open-mmlab/mmclassification/pull/534))
+- Support passing arguments to loss from head. ([#523](https://github.com/open-mmlab/mmclassification/pull/523))
+- Refactor `Resize` transform and add `Pad` transform. ([#506](https://github.com/open-mmlab/mmclassification/pull/506))
+- Update mmcv dependency version. ([#509](https://github.com/open-mmlab/mmclassification/pull/509))
+
+### Bug Fixes
+
+- Fix bug when using `ClassBalancedDataset`. ([#555](https://github.com/open-mmlab/mmclassification/pull/555))
+- Fix a bug when using iter-based runner with 'val' workflow. ([#542](https://github.com/open-mmlab/mmclassification/pull/542))
+- Fix interpolation method checking in `Resize`. ([#547](https://github.com/open-mmlab/mmclassification/pull/547))
+- Fix a bug when load checkpoints in mulit-GPUs environment. ([#527](https://github.com/open-mmlab/mmclassification/pull/527))
+- Fix an error on indexing scalar metrics in `analyze_result.py`. ([#518](https://github.com/open-mmlab/mmclassification/pull/518))
+- Fix wrong condition judgment in `analyze_logs.py` and prevent empty curve. ([#510](https://github.com/open-mmlab/mmclassification/pull/510))
+
+### Docs Update
+
+- Fix vit config and model broken links. ([#564](https://github.com/open-mmlab/mmclassification/pull/564))
+- Add abstract and image for every paper. ([#546](https://github.com/open-mmlab/mmclassification/pull/546))
+- Add mmflow and mim in banner and readme. ([#543](https://github.com/open-mmlab/mmclassification/pull/543))
+- Add schedule and runtime tutorial docs. ([#499](https://github.com/open-mmlab/mmclassification/pull/499))
+- Add the top-5 acc in ResNet-CIFAR README. ([#531](https://github.com/open-mmlab/mmclassification/pull/531))
+- Fix TOC of `visualization.md` and add example images. ([#513](https://github.com/open-mmlab/mmclassification/pull/513))
+- Use docs link of other projects and add MMCV docs. ([#511](https://github.com/open-mmlab/mmclassification/pull/511))
+
+## v0.17.0(29/10/2021)
+
+### Highlights
+
+- Support Tokens-to-Token ViT backbone and Res2Net backbone. Welcome to use!
+- Support ImageNet21k dataset.
+- Add a pipeline visualization tool. Try it with the [tutorials](https://mmclassification.readthedocs.io/en/latest/tools/visualization.html#pipeline-visualization)!
+
+### New Features
+
+- Add Tokens-to-Token ViT backbone and converted checkpoints. ([#467](https://github.com/open-mmlab/mmclassification/pull/467))
+- Add Res2Net backbone and converted weights. ([#465](https://github.com/open-mmlab/mmclassification/pull/465))
+- Support ImageNet21k dataset. ([#461](https://github.com/open-mmlab/mmclassification/pull/461))
+- Support seesaw loss. ([#500](https://github.com/open-mmlab/mmclassification/pull/500))
+- Add a pipeline visualization tool. ([#406](https://github.com/open-mmlab/mmclassification/pull/406))
+- Add a tool to find broken files. ([#482](https://github.com/open-mmlab/mmclassification/pull/482))
+- Add a tool to test TorchServe. ([#468](https://github.com/open-mmlab/mmclassification/pull/468))
+
+### Improvements
+
+- Refator Vision Transformer. ([#395](https://github.com/open-mmlab/mmclassification/pull/395))
+- Use context manager to reuse matplotlib figures. ([#432](https://github.com/open-mmlab/mmclassification/pull/432))
+
+### Bug Fixes
+
+- Remove `DistSamplerSeedHook` if use `IterBasedRunner`. ([#501](https://github.com/open-mmlab/mmclassification/pull/501))
+- Set the priority of `EvalHook` to "LOW" to avoid a bug when using `IterBasedRunner`. ([#488](https://github.com/open-mmlab/mmclassification/pull/488))
+- Fix a wrong parameter of `get_root_logger` in `apis/train.py`. ([#486](https://github.com/open-mmlab/mmclassification/pull/486))
+- Fix version check in dataset builder. ([#474](https://github.com/open-mmlab/mmclassification/pull/474))
+
+### Docs Update
+
+- Add English Colab tutorials and update Chinese Colab tutorials. ([#483](https://github.com/open-mmlab/mmclassification/pull/483), [#497](https://github.com/open-mmlab/mmclassification/pull/497))
+- Add tutuorial for config files. ([#487](https://github.com/open-mmlab/mmclassification/pull/487))
+- Add model-pages in Model Zoo. ([#480](https://github.com/open-mmlab/mmclassification/pull/480))
+- Add code-spell pre-commit hook and fix a large mount of typos. ([#470](https://github.com/open-mmlab/mmclassification/pull/470))
+
+## v0.16.0(30/9/2021)
+
+### Highlights
+
+- We have improved compatibility with downstream repositories like MMDetection and MMSegmentation. We will add some examples about how to use our backbones in MMDetection.
+- Add RepVGG backbone and checkpoints. Welcome to use it!
+- Add timm backbones wrapper, now you can simply use backbones of pytorch-image-models in MMClassification!
+
+### New Features
+
+- Add RepVGG backbone and checkpoints. ([#414](https://github.com/open-mmlab/mmclassification/pull/414))
+- Add timm backbones wrapper. ([#427](https://github.com/open-mmlab/mmclassification/pull/427))
+
+### Improvements
+
+- Fix TnT compatibility and verbose warning. ([#436](https://github.com/open-mmlab/mmclassification/pull/436))
+- Support setting `--out-items` in `tools/test.py`. ([#437](https://github.com/open-mmlab/mmclassification/pull/437))
+- Add datetime info and saving model using torch\<1.6 format. ([#439](https://github.com/open-mmlab/mmclassification/pull/439))
+- Improve downstream repositories compatibility. ([#421](https://github.com/open-mmlab/mmclassification/pull/421))
+- Rename the option `--options` to `--cfg-options` in some tools. ([#425](https://github.com/open-mmlab/mmclassification/pull/425))
+- Add PyTorch 1.9 and Python 3.9 build workflow, and remove some CI. ([#422](https://github.com/open-mmlab/mmclassification/pull/422))
+
+### Bug Fixes
+
+- Fix format error in `test.py` when metric returns `np.ndarray`. ([#441](https://github.com/open-mmlab/mmclassification/pull/441))
+- Fix `publish_model` bug if no parent of `out_file`. ([#463](https://github.com/open-mmlab/mmclassification/pull/463))
+- Fix num_classes bug in pytorch2onnx.py. ([#458](https://github.com/open-mmlab/mmclassification/pull/458))
+- Fix missing runtime requirement `packaging`. ([#459](https://github.com/open-mmlab/mmclassification/pull/459))
+- Fix saving simplified model bug in ONNX export tool. ([#438](https://github.com/open-mmlab/mmclassification/pull/438))
+
+### Docs Update
+
+- Update `getting_started.md` and `install.md`. And rewrite `finetune.md`. ([#466](https://github.com/open-mmlab/mmclassification/pull/466))
+- Use PyTorch style docs theme. ([#457](https://github.com/open-mmlab/mmclassification/pull/457))
+- Update metafile and Readme. ([#435](https://github.com/open-mmlab/mmclassification/pull/435))
+- Add `CITATION.cff`. ([#428](https://github.com/open-mmlab/mmclassification/pull/428))
+
+## v0.15.0(31/8/2021)
+
+### Highlights
+
+- Support `hparams` argument in `AutoAugment` and `RandAugment` to provide hyperparameters for sub-policies.
+- Support custom squeeze channels in `SELayer`.
+- Support classwise weight in losses.
+
+### New Features
+
+- Add `hparams` argument in `AutoAugment` and `RandAugment` and some other improvement. ([#398](https://github.com/open-mmlab/mmclassification/pull/398))
+- Support classwise weight in losses. ([#388](https://github.com/open-mmlab/mmclassification/pull/388))
+- Enhance `SELayer` to support custom squeeze channels. ([#417](https://github.com/open-mmlab/mmclassification/pull/417))
+
+### Code Refactor
+
+- Better result visualization. ([#419](https://github.com/open-mmlab/mmclassification/pull/419))
+- Use `post_process` function to handle pred result processing. ([#390](https://github.com/open-mmlab/mmclassification/pull/390))
+- Update `digit_version` function. ([#402](https://github.com/open-mmlab/mmclassification/pull/402))
+- Avoid albumentations to install both opencv and opencv-headless. ([#397](https://github.com/open-mmlab/mmclassification/pull/397))
+- Avoid unnecessary listdir when building ImageNet. ([#396](https://github.com/open-mmlab/mmclassification/pull/396))
+- Use dynamic mmcv download link in TorchServe dockerfile. ([#387](https://github.com/open-mmlab/mmclassification/pull/387))
+
+### Docs Improvement
+
+- Add readme of some algorithms and update meta yml. ([#418](https://github.com/open-mmlab/mmclassification/pull/418))
+- Add Copyright information. ([#413](https://github.com/open-mmlab/mmclassification/pull/413))
+- Fix typo 'metirc'. ([#411](https://github.com/open-mmlab/mmclassification/pull/411))
+- Update QQ group QR code. ([#393](https://github.com/open-mmlab/mmclassification/pull/393))
+- Add PR template and modify issue template. ([#380](https://github.com/open-mmlab/mmclassification/pull/380))
+
+## v0.14.0(4/8/2021)
+
+### Highlights
+
+- Add transformer-in-transformer backbone and pretrain checkpoints, refers to [the paper](https://arxiv.org/abs/2103.00112).
+- Add Chinese colab tutorial.
+- Provide dockerfile to build mmpretrain dev docker image.
+
+### New Features
+
+- Add transformer in transformer backbone and pretrain checkpoints. ([#339](https://github.com/open-mmlab/mmclassification/pull/339))
+- Support mim, welcome to use mim to manage your mmpretrain project. ([#376](https://github.com/open-mmlab/mmclassification/pull/376))
+- Add Dockerfile. ([#365](https://github.com/open-mmlab/mmclassification/pull/365))
+- Add ResNeSt configs. ([#332](https://github.com/open-mmlab/mmclassification/pull/332))
+
+### Improvements
+
+- Use the `presistent_works` option if available, to accelerate training. ([#349](https://github.com/open-mmlab/mmclassification/pull/349))
+- Add Chinese ipynb tutorial. ([#306](https://github.com/open-mmlab/mmclassification/pull/306))
+- Refactor unit tests. ([#321](https://github.com/open-mmlab/mmclassification/pull/321))
+- Support to test mmdet inference with mmpretrain backbone. ([#343](https://github.com/open-mmlab/mmclassification/pull/343))
+- Use zero as default value of `thrs` in metrics. ([#341](https://github.com/open-mmlab/mmclassification/pull/341))
+
+### Bug Fixes
+
+- Fix ImageNet dataset annotation file parse bug. ([#370](https://github.com/open-mmlab/mmclassification/pull/370))
+- Fix docstring typo and init bug in ShuffleNetV1. ([#374](https://github.com/open-mmlab/mmclassification/pull/374))
+- Use local ATTENTION registry to avoid conflict with other repositories. ([#376](https://github.com/open-mmlab/mmclassification/pull/375))
+- Fix swin transformer config bug. ([#355](https://github.com/open-mmlab/mmclassification/pull/355))
+- Fix `patch_cfg` argument bug in SwinTransformer. ([#368](https://github.com/open-mmlab/mmclassification/pull/368))
+- Fix duplicate `init_weights` call in ViT init function. ([#373](https://github.com/open-mmlab/mmclassification/pull/373))
+- Fix broken `_base_` link in a resnet config. ([#361](https://github.com/open-mmlab/mmclassification/pull/361))
+- Fix vgg-19 model link missing. ([#363](https://github.com/open-mmlab/mmclassification/pull/363))
+
+## v0.13.0(3/7/2021)
+
+- Support Swin-Transformer backbone and add training configs for Swin-Transformer on ImageNet.
+
+### New Features
+
+- Support Swin-Transformer backbone and add training configs for Swin-Transformer on ImageNet. (#271)
+- Add pretained model of RegNetX. (#269)
+- Support adding custom hooks in config file. (#305)
+- Improve and add Chinese translation of `CONTRIBUTING.md` and all tools tutorials. (#320)
+- Dump config before training. (#282)
+- Add torchscript and torchserve deployment tools. (#279, #284)
+
+### Improvements
+
+- Improve test tools and add some new tools. (#322)
+- Correct MobilenetV3 backbone structure and add pretained models. (#291)
+- Refactor `PatchEmbed` and `HybridEmbed` as independent components. (#330)
+- Refactor mixup and cutmix as `Augments` to support more functions. (#278)
+- Refactor weights initialization method. (#270, #318, #319)
+- Refactor `LabelSmoothLoss` to support multiple calculation formulas. (#285)
+
+### Bug Fixes
+
+- Fix bug for CPU training. (#286)
+- Fix missing test data when `num_imgs` can not be evenly divided by `num_gpus`. (#299)
+- Fix build compatible with pytorch v1.3-1.5. (#301)
+- Fix `magnitude_std` bug in `RandAugment`. (#309)
+- Fix bug when `samples_per_gpu` is 1. (#311)
+
+## v0.12.0(3/6/2021)
+
+- Finish adding Chinese tutorials and build Chinese documentation on readthedocs.
+- Update ResNeXt checkpoints and ResNet checkpoints on CIFAR.
+
+### New Features
+
+- Improve and add Chinese translation of `data_pipeline.md` and `new_modules.md`. (#265)
+- Build Chinese translation on readthedocs. (#267)
+- Add an argument efficientnet_style to `RandomResizedCrop` and `CenterCrop`. (#268)
+
+### Improvements
+
+- Only allow directory operation when rank==0 when testing. (#258)
+- Fix typo in `base_head`. (#274)
+- Update ResNeXt checkpoints. (#283)
+
+### Bug Fixes
+
+- Add attribute `data.test` in MNIST configs. (#264)
+- Download CIFAR/MNIST dataset only on rank 0. (#273)
+- Fix MMCV version compatibility. (#276)
+- Fix CIFAR color channels bug and update checkpoints in model zoo. (#280)
+
+## v0.11.1(21/5/2021)
+
+- Refine `new_dataset.md` and add Chinese translation of `finture.md`, `new_dataset.md`.
+
+### New Features
+
+- Add `dim` argument for `GlobalAveragePooling`. (#236)
+- Add random noise to `RandAugment` magnitude. (#240)
+- Refine `new_dataset.md` and add Chinese translation of `finture.md`, `new_dataset.md`. (#243)
+
+### Improvements
+
+- Refactor arguments passing for Heads. (#239)
+- Allow more flexible `magnitude_range` in `RandAugment`. (#249)
+- Inherits MMCV registry so that in the future OpenMMLab repos like MMDet and MMSeg could directly use the backbones supported in MMCls. (#252)
+
+### Bug Fixes
+
+- Fix typo in `analyze_results.py`. (#237)
+- Fix typo in unittests. (#238)
+- Check if specified tmpdir exists when testing to avoid deleting existing data. (#242 & #258)
+- Add missing config files in `MANIFEST.in`. (#250 & #255)
+- Use temporary directory under shared directory to collect results to avoid unavailability of temporary directory for multi-node testing. (#251)
+
+## v0.11.0(1/5/2021)
+
+- Support cutmix trick.
+- Support random augmentation.
+- Add `tools/deployment/test.py` as a ONNX runtime test tool.
+- Support ViT backbone and add training configs for ViT on ImageNet.
+- Add Chinese `README.md` and some Chinese tutorials.
+
+### New Features
+
+- Support cutmix trick. (#198)
+- Add `simplify` option in `pytorch2onnx.py`. (#200)
+- Support random augmentation. (#201)
+- Add config and checkpoint for training ResNet on CIFAR-100. (#208)
+- Add `tools/deployment/test.py` as a ONNX runtime test tool. (#212)
+- Support ViT backbone and add training configs for ViT on ImageNet. (#214)
+- Add finetuning configs for ViT on ImageNet. (#217)
+- Add `device` option to support training on CPU. (#219)
+- Add Chinese `README.md` and some Chinese tutorials. (#221)
+- Add `metafile.yml` in configs to support interaction with paper with code(PWC) and MMCLI. (#225)
+- Upload configs and converted checkpoints for ViT fintuning on ImageNet. (#230)
+
+### Improvements
+
+- Fix `LabelSmoothLoss` so that label smoothing and mixup could be enabled at the same time. (#203)
+- Add `cal_acc` option in `ClsHead`. (#206)
+- Check `CLASSES` in checkpoint to avoid unexpected key error. (#207)
+- Check mmcv version when importing mmpretrain to ensure compatibility. (#209)
+- Update `CONTRIBUTING.md` to align with that in MMCV. (#210)
+- Change tags to html comments in configs README.md. (#226)
+- Clean codes in ViT backbone. (#227)
+- Reformat `pytorch2onnx.md` tutorial. (#229)
+- Update `setup.py` to support MMCLI. (#232)
+
+### Bug Fixes
+
+- Fix missing `cutmix_prob` in ViT configs. (#220)
+- Fix backend for resize in ResNeXt configs. (#222)
+
+## v0.10.0(1/4/2021)
+
+- Support AutoAugmentation
+- Add tutorials for installation and usage.
+
+### New Features
+
+- Add `Rotate` pipeline for data augmentation. (#167)
+- Add `Invert` pipeline for data augmentation. (#168)
+- Add `Color` pipeline for data augmentation. (#171)
+- Add `Solarize` and `Posterize` pipeline for data augmentation. (#172)
+- Support fp16 training. (#178)
+- Add tutorials for installation and basic usage of MMClassification.(#176)
+- Support `AutoAugmentation`, `AutoContrast`, `Equalize`, `Contrast`, `Brightness` and `Sharpness` pipelines for data augmentation. (#179)
+
+### Improvements
+
+- Support dynamic shape export to onnx. (#175)
+- Release training configs and update model zoo for fp16 (#184)
+- Use MMCV's EvalHook in MMClassification (#182)
+
+### Bug Fixes
+
+- Fix wrong naming in vgg config (#181)
+
+## v0.9.0(1/3/2021)
+
+- Implement mixup trick.
+- Add a new tool to create TensorRT engine from ONNX, run inference and verify outputs in Python.
+
+### New Features
+
+- Implement mixup and provide configs of training ResNet50 using mixup. (#160)
+- Add `Shear` pipeline for data augmentation. (#163)
+- Add `Translate` pipeline for data augmentation. (#165)
+- Add `tools/onnx2tensorrt.py` as a tool to create TensorRT engine from ONNX, run inference and verify outputs in Python. (#153)
+
+### Improvements
+
+- Add `--eval-options` in `tools/test.py` to support eval options override, matching the behavior of other open-mmlab projects. (#158)
+- Support showing and saving painted results in `mmpretrain.apis.test` and `tools/test.py`, matching the behavior of other open-mmlab projects. (#162)
+
+### Bug Fixes
+
+- Fix configs for VGG, replace checkpoints converted from other repos with the ones trained by ourselves and upload the missing logs in the model zoo. (#161)
+
+## v0.8.0(31/1/2021)
+
+- Support multi-label task.
+- Support more flexible metrics settings.
+- Fix bugs.
+
+### New Features
+
+- Add evaluation metrics: mAP, CP, CR, CF1, OP, OR, OF1 for multi-label task. (#123)
+- Add BCE loss for multi-label task. (#130)
+- Add focal loss for multi-label task. (#131)
+- Support PASCAL VOC 2007 dataset for multi-label task. (#134)
+- Add asymmetric loss for multi-label task. (#132)
+- Add analyze_results.py to select images for success/fail demonstration. (#142)
+- Support new metric that calculates the total number of occurrences of each label. (#143)
+- Support class-wise evaluation results. (#143)
+- Add thresholds in eval_metrics. (#146)
+- Add heads and a baseline config for multilabel task. (#145)
+
+### Improvements
+
+- Remove the models with 0 checkpoint and ignore the repeated papers when counting papers to gain more accurate model statistics. (#135)
+- Add tags in README.md. (#137)
+- Fix optional issues in docstring. (#138)
+- Update stat.py to classify papers. (#139)
+- Fix mismatched columns in README.md. (#150)
+- Fix test.py to support more evaluation metrics. (#155)
+
+### Bug Fixes
+
+- Fix bug in VGG weight_init. (#140)
+- Fix bug in 2 ResNet configs in which outdated heads were used. (#147)
+- Fix bug of misordered height and width in `RandomCrop` and `RandomResizedCrop`. (#151)
+- Fix missing `meta_keys` in `Collect`. (#149 & #152)
+
+## v0.7.0(31/12/2020)
+
+- Add more evaluation metrics.
+- Fix bugs.
+
+### New Features
+
+- Remove installation of MMCV from requirements. (#90)
+- Add 3 evaluation metrics: precision, recall and F-1 score. (#93)
+- Allow config override during testing and inference with `--options`. (#91 & #96)
+
+### Improvements
+
+- Use `build_runner` to make runners more flexible. (#54)
+- Support to get category ids in `BaseDataset`. (#72)
+- Allow `CLASSES` override during `BaseDateset` initialization. (#85)
+- Allow input image as ndarray during inference. (#87)
+- Optimize MNIST config. (#98)
+- Add config links in model zoo documentation. (#99)
+- Use functions from MMCV to collect environment. (#103)
+- Refactor config files so that they are now categorized by methods. (#116)
+- Add README in config directory. (#117)
+- Add model statistics. (#119)
+- Refactor documentation in consistency with other MM repositories. (#126)
+
+### Bug Fixes
+
+- Add missing `CLASSES` argument to dataset wrappers. (#66)
+- Fix slurm evaluation error during training. (#69)
+- Resolve error caused by shape in `Accuracy`. (#104)
+- Fix bug caused by extremely insufficient data in distributed sampler.(#108)
+- Fix bug in `gpu_ids` in distributed training. (#107)
+- Fix bug caused by extremely insufficient data in collect results during testing (#114)
+
+## v0.6.0(11/10/2020)
+
+- Support new method: ResNeSt and VGG.
+- Support new dataset: CIFAR10.
+- Provide new tools to do model inference, model conversion from pytorch to onnx.
+
+### New Features
+
+- Add model inference. (#16)
+- Add pytorch2onnx. (#20)
+- Add PIL backend for transform `Resize`. (#21)
+- Add ResNeSt. (#25)
+- Add VGG and its pretained models. (#27)
+- Add CIFAR10 configs and models. (#38)
+- Add albumentations transforms. (#45)
+- Visualize results on image demo. (#58)
+
+### Improvements
+
+- Replace urlretrieve with urlopen in dataset.utils. (#13)
+- Resize image according to its short edge. (#22)
+- Update ShuffleNet config. (#31)
+- Update pre-trained models for shufflenet_v2, shufflenet_v1, se-resnet50, se-resnet101. (#33)
+
+### Bug Fixes
+
+- Fix init_weights in `shufflenet_v2.py`. (#29)
+- Fix the parameter `size` in test_pipeline. (#30)
+- Fix the parameter in cosine lr schedule. (#32)
+- Fix the convert tools for mobilenet_v2. (#34)
+- Fix crash in CenterCrop transform when image is greyscale (#40)
+- Fix outdated configs. (#53)
diff --git a/docs/en/notes/contribution_guide.md b/docs/en/notes/contribution_guide.md
new file mode 120000
index 0000000000000000000000000000000000000000..c97564d93a7f0a753a23cd97d2467d595bd154ff
--- /dev/null
+++ b/docs/en/notes/contribution_guide.md
@@ -0,0 +1 @@
+../../../CONTRIBUTING.md
\ No newline at end of file
diff --git a/docs/en/notes/faq.md b/docs/en/notes/faq.md
new file mode 100644
index 0000000000000000000000000000000000000000..da45841bb10c347bb3724d5e49e90ab5199c5caf
--- /dev/null
+++ b/docs/en/notes/faq.md
@@ -0,0 +1,116 @@
+# Frequently Asked Questions
+
+We list some common troubles faced by many users and their corresponding
+solutions here. Feel free to enrich the list if you find any frequent issues
+and have ways to help others to solve them. If the contents here do not cover
+your issue, please create an issue using the
+[provided templates](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+and make sure you fill in all required information in the template.
+
+## Installation
+
+- Compatibility issue between MMEngine, MMCV and MMPretrain
+
+ Compatible MMPretrain and MMEngine, MMCV versions are shown as below. Please
+ choose the correct version of MMEngine and MMCV to avoid installation issues.
+
+ | MMPretrain version | MMEngine version | MMCV version |
+ | :----------------: | :---------------: | :--------------: |
+ | 1.2.0 (main) | mmengine >= 0.8.3 | mmcv >= 2.0.0 |
+ | 1.1.1 | mmengine >= 0.8.3 | mmcv >= 2.0.0 |
+ | 1.0.0 | mmengine >= 0.8.0 | mmcv >= 2.0.0 |
+ | 1.0.0rc8 | mmengine >= 0.7.1 | mmcv >= 2.0.0rc4 |
+ | 1.0.0rc7 | mmengine >= 0.5.0 | mmcv >= 2.0.0rc4 |
+
+ ```{note}
+ Since the `dev` branch is under frequent development, the MMEngine and MMCV
+ version dependency may be inaccurate. If you encounter problems when using
+ the `dev` branch, please try to update MMEngine and MMCV to the latest version.
+ ```
+
+- Using Albumentations
+
+ If you would like to use `albumentations`, we suggest using `pip install -r requirements/albu.txt` or
+ `pip install -U albumentations --no-binary qudida,albumentations`.
+
+ If you simply use `pip install albumentations>=0.3.2`, it will install `opencv-python-headless` simultaneously
+ (even though you have already installed `opencv-python`). Please refer to the
+ [official documentation](https://albumentations.ai/docs/getting_started/installation/#note-on-opencv-dependencies)
+ for details.
+
+## General Questions
+
+### Do I need to reinstall mmpretrain after some code modifications?
+
+If you follow [the best practice](../get_started.md#best-practices) and install mmpretrain from source,
+any local modifications made to the code will take effect without
+reinstallation.
+
+### How to develop with multiple MMPretrain versions?
+
+Generally speaking, we recommend to use different virtual environments to
+manage MMPretrain in different working directories. However, you
+can also use the same environment to develop MMPretrain in different
+folders, like mmpretrain-0.21, mmpretrain-0.23. When you run the train or test shell script,
+it will adopt the mmpretrain package in the current folder. And when you run other Python
+script, you can also add `` PYTHONPATH=`pwd` `` at the beginning of your command
+to use the package in the current folder.
+
+Conversely, to use the default MMPretrain installed in the environment
+rather than the one you are working with, you can remove the following line
+in those shell scripts:
+
+```shell
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH
+```
+
+### What's the relationship between the `load_from` and the `init_cfg`?
+
+- `load_from`: If `resume=False`, only imports model weights, which is mainly used to load trained models;
+ If `resume=True`, load all of the model weights, optimizer state, and other training information, which is
+ mainly used to resume interrupted training.
+
+- `init_cfg`: You can also specify `init=dict(type="Pretrained", checkpoint=xxx)` to load checkpoint, it
+ means load the weights during model weights initialization. That is, it will be only done at the
+ beginning of the training. It's mainly used to fine-tune a pre-trained model, and you can set it in
+ the backbone config and use `prefix` field to only load backbone weights, for example:
+
+```python
+model = dict(
+ backbone=dict(
+ type='ResNet',
+ depth=50,
+ init_cfg=dict(type='Pretrained', checkpoints=xxx, prefix='backbone'),
+ )
+ ...
+)
+```
+
+See the [Fine-tune Models](./finetune_custom_dataset.md) for more details about fine-tuning.
+
+### What's the difference between `default_hooks` and `custom_hooks`?
+
+Almost no difference. Usually, the `default_hooks` field is used to specify the hooks that will be used in almost
+all experiments, and the `custom_hooks` field is used in only some experiments.
+
+Another difference is the `default_hooks` is a dict while the `custom_hooks` is a list, please don't be
+confused.
+
+### During training, I got no training log, what's the reason?
+
+If your training dataset is small while the batch size is large, our default log interval may be too large to
+record your training log.
+
+You can shrink the log interval and try again, like:
+
+```python
+default_hooks = dict(
+ ...
+ logger=dict(type='LoggerHook', interval=10),
+ ...
+)
+```
+
+### How to train with other datasets, like my own dataset or COCO?
+
+We provide [specific examples](./pretrain_custom_dataset.md) to show how to train with other datasets.
diff --git a/docs/en/notes/finetune_custom_dataset.md b/docs/en/notes/finetune_custom_dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..4000268ca47651233799dbcba3add351979e65c0
--- /dev/null
+++ b/docs/en/notes/finetune_custom_dataset.md
@@ -0,0 +1,340 @@
+# How to Fine-tune with Custom Dataset
+
+In most scenarios, we want to apply a pre-trained model without training from scratch, which might possibly introduce extra uncertainties about the model convergency and therefore, is time-consuming.
+The common sense is to learn from previous models trained on large dataset, which can hopefully provide better knowledge than a random beginner. Roughly speaking, this process is as known as fine-tuning.
+
+Models pre-trained on the ImageNet dataset have been demonstrated to be effective for other datasets and other downstream tasks.
+Hence, this tutorial provides instructions for users to use the models provided in the [Model Zoo](../modelzoo_statistics.md) for other datasets to obtain better performance.
+
+In this tutorial, we provide a practice example and some tips on how to fine-tune a model on your own dataset.
+
+## Step-1: Prepare your dataset
+
+Prepare your dataset following [Prepare Dataset](../user_guides/dataset_prepare.md).
+And the root folder of the dataset can be like `data/custom_dataset/`.
+
+Here, we assume you want to do supervised image-classification training, and use the sub-folder format
+`CustomDataset` to organize your dataset as:
+
+```text
+data/custom_dataset/
+├── train
+│ ├── class_x
+│ │ ├── x_1.png
+│ │ ├── x_2.png
+│ │ ├── x_3.png
+│ │ └── ...
+│ ├── class_y
+│ └── ...
+└── test
+ ├── class_x
+ │ ├── test_x_1.png
+ │ ├── test_x_2.png
+ │ ├── test_x_3.png
+ │ └── ...
+ ├── class_y
+ └── ...
+```
+
+## Step-2: Choose one config as template
+
+Here, we would like to use `configs/resnet/resnet50_8xb32_in1k.py` as the example. We first copy this config
+file to the same folder and rename it as `resnet50_8xb32-ft_custom.py`.
+
+```{tip}
+As a convention, the last field of the config name is the dataset, e.g.,`in1k` for ImageNet dataset, `coco` for COCO dataset
+```
+
+The content of this config is:
+
+```python
+_base_ = [
+ '../_base_/models/resnet50.py', # model settings
+ '../_base_/datasets/imagenet_bs32.py', # data settings
+ '../_base_/schedules/imagenet_bs256.py', # schedule settings
+ '../_base_/default_runtime.py', # runtime settings
+]
+```
+
+## Step-3: Edit the model settings
+
+When fine-tuning a model, usually we want to load the pre-trained backbone
+weights and train a new classification head from scratch.
+
+To load the pre-trained backbone, we need to change the initialization config
+of the backbone and use `Pretrained` initialization function. Besides, in the
+`init_cfg`, we use `prefix='backbone'` to tell the initialization function
+the prefix of the submodule that needs to be loaded in the checkpoint.
+
+For example, `backbone` here means to load the backbone submodule. And here we
+use an online checkpoint, it will be downloaded automatically during training,
+you can also download the model manually and use a local path.
+And then we need to modify the head according to the class numbers of the new
+datasets by just changing `num_classes` in the head.
+
+When new dataset is small and shares the domain with the pre-trained dataset,
+we might want to freeze the first several stages' parameters of the
+backbone, that will help the network to keep ability to extract low-level
+information learnt from pre-trained model. In MMPretrain, you can simply
+specify how many stages to freeze by `frozen_stages` argument. For example, to
+freeze the first two stages' parameters, just use the following configs:
+
+```{note}
+Not all backbones support the `frozen_stages` argument by now. Please check
+[the docs](https://mmpretrain.readthedocs.io/en/latest/api.html#module-mmpretrain.models.backbones)
+to confirm if your backbone supports it.
+```
+
+```python
+_base_ = [
+ '../_base_/models/resnet50.py', # model settings
+ '../_base_/datasets/imagenet_bs32.py', # data settings
+ '../_base_/schedules/imagenet_bs256.py', # schedule settings
+ '../_base_/default_runtime.py', # runtime settings
+]
+
+# >>>>>>>>>>>>>>> Override model settings here >>>>>>>>>>>>>>>>>>>
+model = dict(
+ backbone=dict(
+ frozen_stages=2,
+ init_cfg=dict(
+ type='Pretrained',
+ checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+ prefix='backbone',
+ )),
+ head=dict(num_classes=10),
+)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+```
+
+```{tip}
+Here we only need to set the part of configs we want to modify, because the
+inherited configs will be merged and get the entire configs.
+```
+
+## Step-4: Edit the dataset settings
+
+To fine-tuning on a new dataset, we need to override some dataset settings, like the type of dataset, data
+pipeline, etc.
+
+```python
+_base_ = [
+ '../_base_/models/resnet50.py', # model settings
+ '../_base_/datasets/imagenet_bs32.py', # data settings
+ '../_base_/schedules/imagenet_bs256.py', # schedule settings
+ '../_base_/default_runtime.py', # runtime settings
+]
+
+# model settings
+model = dict(
+ backbone=dict(
+ frozen_stages=2,
+ init_cfg=dict(
+ type='Pretrained',
+ checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+ prefix='backbone',
+ )),
+ head=dict(num_classes=10),
+)
+
+# >>>>>>>>>>>>>>> Override data settings here >>>>>>>>>>>>>>>>>>>
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+ dataset=dict(
+ type='CustomDataset',
+ data_root=data_root,
+ ann_file='', # We assume you are using the sub-folder format without ann_file
+ data_prefix='train',
+ ))
+val_dataloader = dict(
+ dataset=dict(
+ type='CustomDataset',
+ data_root=data_root,
+ ann_file='', # We assume you are using the sub-folder format without ann_file
+ data_prefix='test',
+ ))
+test_dataloader = val_dataloader
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+```
+
+## Step-5: Edit the schedule settings (optional)
+
+The fine-tuning hyper parameters vary from the default schedule. It usually
+requires smaller learning rate and quicker decaying scheduler epochs.
+
+```python
+_base_ = [
+ '../_base_/models/resnet50.py', # model settings
+ '../_base_/datasets/imagenet_bs32.py', # data settings
+ '../_base_/schedules/imagenet_bs256.py', # schedule settings
+ '../_base_/default_runtime.py', # runtime settings
+]
+
+# model settings
+model = dict(
+ backbone=dict(
+ frozen_stages=2,
+ init_cfg=dict(
+ type='Pretrained',
+ checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+ prefix='backbone',
+ )),
+ head=dict(num_classes=10),
+)
+
+# data settings
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+ dataset=dict(
+ type='CustomDataset',
+ data_root=data_root,
+ ann_file='', # We assume you are using the sub-folder format without ann_file
+ data_prefix='train',
+ ))
+val_dataloader = dict(
+ dataset=dict(
+ type='CustomDataset',
+ data_root=data_root,
+ ann_file='', # We assume you are using the sub-folder format without ann_file
+ data_prefix='test',
+ ))
+test_dataloader = val_dataloader
+
+# >>>>>>>>>>>>>>> Override schedule settings here >>>>>>>>>>>>>>>>>>>
+# optimizer hyper-parameters
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))
+# learning policy
+param_scheduler = dict(
+ type='MultiStepLR', by_epoch=True, milestones=[15], gamma=0.1)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+```
+
+```{tip}
+Refers to [Learn about Configs](../user_guides/config.md) for more detailed configurations.
+```
+
+## Start Training
+
+Now, we have finished the fine-tuning config file as following:
+
+```python
+_base_ = [
+ '../_base_/models/resnet50.py', # model settings
+ '../_base_/datasets/imagenet_bs32.py', # data settings
+ '../_base_/schedules/imagenet_bs256.py', # schedule settings
+ '../_base_/default_runtime.py', # runtime settings
+]
+
+# model settings
+model = dict(
+ backbone=dict(
+ frozen_stages=2,
+ init_cfg=dict(
+ type='Pretrained',
+ checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+ prefix='backbone',
+ )),
+ head=dict(num_classes=10),
+)
+
+# data settings
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+ dataset=dict(
+ type='CustomDataset',
+ data_root=data_root,
+ ann_file='', # We assume you are using the sub-folder format without ann_file
+ data_prefix='train',
+ ))
+val_dataloader = dict(
+ dataset=dict(
+ type='CustomDataset',
+ data_root=data_root,
+ ann_file='', # We assume you are using the sub-folder format without ann_file
+ data_prefix='test',
+ ))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))
+param_scheduler = dict(
+ type='MultiStepLR', by_epoch=True, milestones=[15], gamma=0.1)
+```
+
+Here we use 8 GPUs on your computer to train the model with the following command:
+
+```shell
+bash tools/dist_train.sh configs/resnet/resnet50_8xb32-ft_custom.py 8
+```
+
+Also, you can use only one GPU to train the model with the following command:
+
+```shell
+python tools/train.py configs/resnet/resnet50_8xb32-ft_custom.py
+```
+
+But wait, an important config need to be changed if using one GPU. We need to
+change the dataset config as following:
+
+```python
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+ batch_size=256,
+ dataset=dict(
+ type='CustomDataset',
+ data_root=data_root,
+ ann_file='', # We assume you are using the sub-folder format without ann_file
+ data_prefix='train',
+ ))
+val_dataloader = dict(
+ dataset=dict(
+ type='CustomDataset',
+ data_root=data_root,
+ ann_file='', # We assume you are using the sub-folder format without ann_file
+ data_prefix='test',
+ ))
+test_dataloader = val_dataloader
+```
+
+It's because our training schedule is for a batch size of 256. If using 8 GPUs,
+just use `batch_size=32` config in the base config file for every GPU, and the total batch
+size will be 256. But if using one GPU, you need to change it to 256 manually to
+match the training schedule.
+
+However, a larger batch size requires a larger GPU memory, and here are several simple tricks to save the GPU
+memory:
+
+1. Enable Automatic-Mixed-Precision training.
+
+ ```shell
+ python tools/train.py configs/resnet/resnet50_8xb32-ft_custom.py --amp
+ ```
+
+2. Use a smaller batch size, like `batch_size=32` instead of 256, and enable the auto learning rate scaling.
+
+ ```shell
+ python tools/train.py configs/resnet/resnet50_8xb32-ft_custom.py --auto-scale-lr
+ ```
+
+ The auto learning rate scaling will adjust the learning rate according to the actual batch size and the
+ `auto_scale_lr.base_batch_size` (You can find it in the base config
+ `configs/_base_/schedules/imagenet_bs256.py`)
+
+```{note}
+Most of these tricks may influence the training performance slightly.
+```
+
+### Apply pre-trained model with command line
+
+If you don't want to modify the configs, you could use `--cfg-options` to add your pre-trained model path to `init_cfg`.
+
+For example, the command below will also load pre-trained model.
+
+```shell
+bash tools/dist_train.sh configs/resnet/resnet50_8xb32-ft_custom.py 8 \
+ --cfg-options model.backbone.init_cfg.type='Pretrained' \
+ model.backbone.init_cfg.checkpoint='https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth' \
+ model.backbone.init_cfg.prefix='backbone' \
+```
diff --git a/docs/en/notes/pretrain_custom_dataset.md b/docs/en/notes/pretrain_custom_dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..c9e583799c2922579b3611892e4dae56ca2a285d
--- /dev/null
+++ b/docs/en/notes/pretrain_custom_dataset.md
@@ -0,0 +1,255 @@
+# How to Pretrain with Custom Dataset
+
+In this tutorial, we provide a practice example and some tips on how to train on your own dataset.
+
+In MMPretrain, We support the `CustomDataset` (similar to the `ImageFolder` in `torchvision`), which is able to read the images within the specified folder directly. You only need to prepare the path information of the custom dataset and edit the config.
+
+## Step-1: Prepare your dataset
+
+Prepare your dataset following [Prepare Dataset](../user_guides/dataset_prepare.md).
+And the root folder of the dataset can be like `data/custom_dataset/`.
+
+Here, we assume you want to do unsupervised training, and use the sub-folder format `CustomDataset` to
+organize your dataset as:
+
+```text
+data/custom_dataset/
+├── sample1.png
+├── sample2.png
+├── sample3.png
+├── sample4.png
+└── ...
+```
+
+## Step-2: Choose one config as template
+
+Here, we would like to use `configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py` as the example. We
+first copy this config file to the same folder and rename it as
+`mae_vit-base-p16_8xb512-amp-coslr-300e_custom.py`.
+
+```{tip}
+As a convention, the last field of the config name is the dataset, e.g.,`in1k` for ImageNet dataset, `coco` for COCO dataset
+```
+
+The content of this config is:
+
+```python
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=260,
+ by_epoch=True,
+ begin=40,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
+```
+
+## Step-3: Edit the dataset related config
+
+- Override the `type` of dataset settings as `'CustomDataset'`
+- Override the `data_root` of dataset settings as `data/custom_dataset`.
+- Override the `ann_file` of dataset settings as an empty string since we assume you are using the sub-folder
+ format `CustomDataset`.
+- Override the `data_prefix` of dataset settings as an empty string since we are using the whole dataset under
+ the `data_root`, and you don't need to split samples into different subset and set the `data_prefix`.
+
+The modified config will be like:
+
+```python
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_bs512_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# >>>>>>>>>>>>>>> Override dataset settings here >>>>>>>>>>>>>>>>>>>
+train_dataloader = dict(
+ dataset=dict(
+ type='CustomDataset',
+ data_root='data/custom_dataset/',
+ ann_file='', # We assume you are using the sub-folder format without ann_file
+ data_prefix='', # The `data_root` is the data_prefix directly.
+ with_label=False,
+ )
+)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=260,
+ by_epoch=True,
+ begin=40,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
+```
+
+By using the edited config file, you are able to train a self-supervised model with MAE algorithm on the custom dataset.
+
+## Another example: Train MAE on COCO Dataset
+
+```{note}
+You need to install MMDetection to use the `mmdet.CocoDataset` follow this [documentation](https://github.com/open-mmlab/mmdetection/blob/3.x/docs/en/get_started.md)
+```
+
+Follow the aforementioned idea, we also present an example of how to train MAE on COCO dataset. The edited file will be like this:
+
+```python
+_base_ = [
+ '../_base_/models/mae_vit-base-p16.py',
+ '../_base_/datasets/imagenet_mae.py',
+ '../_base_/default_runtime.py',
+]
+
+# >>>>>>>>>>>>>>> Override dataset settings here >>>>>>>>>>>>>>>>>>>
+train_dataloader = dict(
+ dataset=dict(
+ type='mmdet.CocoDataset',
+ data_root='data/coco/',
+ ann_file='annotations/instances_train2017.json', # Only for loading images, and the labels won't be used.
+ data_prefix=dict(img='train2017/'),
+ )
+)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+
+# optimizer wrapper
+optim_wrapper = dict(
+ type='AmpOptimWrapper',
+ loss_scale='dynamic',
+ optimizer=dict(
+ type='AdamW',
+ lr=1.5e-4 * 4096 / 256,
+ betas=(0.9, 0.95),
+ weight_decay=0.05),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'ln': dict(decay_mult=0.0),
+ 'bias': dict(decay_mult=0.0),
+ 'pos_embed': dict(decay_mult=0.),
+ 'mask_token': dict(decay_mult=0.),
+ 'cls_token': dict(decay_mult=0.)
+ }))
+
+# learning rate scheduler
+param_scheduler = [
+ dict(
+ type='LinearLR',
+ start_factor=0.0001,
+ by_epoch=True,
+ begin=0,
+ end=40,
+ convert_to_iter_based=True),
+ dict(
+ type='CosineAnnealingLR',
+ T_max=260,
+ by_epoch=True,
+ begin=40,
+ end=300,
+ convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+ # only keeps the latest 3 checkpoints
+ checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
+```
diff --git a/docs/en/notes/projects.md b/docs/en/notes/projects.md
new file mode 100644
index 0000000000000000000000000000000000000000..d6b625432948307e76e5ddeb5b994e575874e425
--- /dev/null
+++ b/docs/en/notes/projects.md
@@ -0,0 +1,21 @@
+# Projects based on MMPretrain
+
+There are many projects built upon MMPretrain(MMClassification previsously).
+We list some of them as examples of how to extend MMPretrain(MMClassification previsously) for your own projects.
+As the page might not be completed, please feel free to create a PR to update this page.
+
+## Projects as an extension
+
+- [OpenMixup](https://github.com/Westlake-AI/openmixup): an open-source toolbox for supervised, self-, and semi-supervised visual representation learning with mixup based on PyTorch, especially for mixup-related methods.
+- [AI Power](https://github.com/ykk648/AI_power): AI toolbox and pretrain models.
+- [OpenBioSeq](https://github.com/Westlake-AI/OpenBioSeq): an open-source supervised and self-supervised bio-sequence representation learning toolbox based on PyTorch.
+
+## Projects of papers
+
+There are also projects released with papers.
+Some of the papers are published in top-tier conferences (CVPR, ICCV, and ECCV), the others are also highly influential.
+To make this list also a reference for the community to develop and compare new image classification algorithms, we list them following the time order of top-tier conferences.
+Methods already supported and maintained by MMPretrain(MMClassification previsously) are not listed.
+
+- Involution: Inverting the Inherence of Convolution for Visual Recognition, CVPR21. [[paper]](https://arxiv.org/abs/2103.06255)[[github]](https://github.com/d-li14/involution)
+- Convolution of Convolution: Let Kernels Spatially Collaborate, CVPR22. [[paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhao_Convolution_of_Convolution_Let_Kernels_Spatially_Collaborate_CVPR_2022_paper.pdf)[[github]](https://github.com/Genera1Z/ConvolutionOfConvolution)
diff --git a/docs/en/stat.py b/docs/en/stat.py
new file mode 100755
index 0000000000000000000000000000000000000000..2d74823b10020af523bba787bbca7521ff797f17
--- /dev/null
+++ b/docs/en/stat.py
@@ -0,0 +1,249 @@
+#!/usr/bin/env python
+import re
+import warnings
+from collections import defaultdict
+from pathlib import Path
+
+from modelindex.load_model_index import load
+from modelindex.models.Result import Result
+from tabulate import tabulate
+
+MMPT_ROOT = Path(__file__).absolute().parents[2]
+PAPERS_ROOT = Path('papers') # Path to save generated paper pages.
+GITHUB_PREFIX = 'https://github.com/open-mmlab/mmpretrain/blob/main/'
+MODELZOO_TEMPLATE = """\
+# Model Zoo Summary
+
+In this page, we list [all algorithms](#all-supported-algorithms) we support. You can click the link to jump to the corresponding model pages.
+
+And we also list all checkpoints for different tasks we provide. You can sort or search checkpoints in the table and click the corresponding link to model pages for more details.
+
+## All supported algorithms
+
+* Number of papers: {num_papers}
+{type_msg}
+
+* Number of checkpoints: {num_ckpts}
+{paper_msg}
+
+""" # noqa: E501
+
+METRIC_ALIAS = {
+ 'Top 1 Accuracy': 'Top-1 (%)',
+ 'Top 5 Accuracy': 'Top-5 (%)',
+}
+
+model_index = load(str(MMPT_ROOT / 'model-index.yml'))
+
+
+def build_collections(model_index):
+ col_by_name = {}
+ for col in model_index.collections:
+ setattr(col, 'models', [])
+ col_by_name[col.name] = col
+
+ for model in model_index.models:
+ col = col_by_name[model.in_collection]
+ col.models.append(model)
+ setattr(model, 'collection', col)
+ if model.results is None:
+ setattr(model, 'tasks', [])
+ else:
+ setattr(model, 'tasks', [result.task for result in model.results])
+
+
+build_collections(model_index)
+
+
+def count_papers(collections):
+ total_num_ckpts = 0
+ type_count = defaultdict(int)
+ paper_msgs = []
+
+ for collection in collections:
+ with open(MMPT_ROOT / collection.readme) as f:
+ readme = f.read()
+ ckpts = set(x.lower().strip()
+ for x in re.findall(r'\[model\]\((https?.*)\)', readme))
+ total_num_ckpts += len(ckpts)
+ title = collection.paper['Title']
+ papertype = collection.data.get('type', 'Algorithm')
+ type_count[papertype] += 1
+
+ readme = PAPERS_ROOT / Path(
+ collection.filepath).parent.with_suffix('.md').name
+ paper_msgs.append(
+ f'\t- [{papertype}] [{title}]({readme}) ({len(ckpts)} ckpts)')
+
+ type_msg = '\n'.join(
+ [f'\t- {type_}: {count}' for type_, count in type_count.items()])
+ paper_msg = '\n'.join(paper_msgs)
+
+ modelzoo = MODELZOO_TEMPLATE.format(
+ num_papers=len(collections),
+ num_ckpts=total_num_ckpts,
+ type_msg=type_msg,
+ paper_msg=paper_msg,
+ )
+
+ with open('modelzoo_statistics.md', 'w') as f:
+ f.write(modelzoo)
+
+
+count_papers(model_index.collections)
+
+
+def generate_paper_page(collection):
+ PAPERS_ROOT.mkdir(exist_ok=True)
+
+ # Write a copy of README
+ with open(MMPT_ROOT / collection.readme) as f:
+ readme = f.read()
+ folder = Path(collection.filepath).parent
+ copy = PAPERS_ROOT / folder.with_suffix('.md').name
+
+ def replace_link(matchobj):
+ # Replace relative link to GitHub link.
+ name = matchobj.group(1)
+ link = matchobj.group(2)
+ if not link.startswith('http'):
+ assert (folder / link).exists(), \
+ f'Link not found:\n{collection.readme}: {link}'
+ rel_link = (folder / link).absolute().relative_to(MMPT_ROOT)
+ link = GITHUB_PREFIX + str(rel_link)
+ return f'[{name}]({link})'
+
+ content = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', replace_link, readme)
+ content = f'---\ngithub_page: /{collection.readme}\n---\n' + content
+
+ def make_tabs(matchobj):
+ """modify the format from emphasis black symbol to tabs."""
+ content = matchobj.group()
+ content = content.replace('', '')
+ content = content.replace('', '')
+
+ # split the content by "**{Tab-Name}**""
+ splits = re.split(r'^\*\*(.*)\*\*$', content, flags=re.M)[1:]
+ tabs_list = []
+ for title, tab_content in zip(splits[::2], splits[1::2]):
+ title = ':::{tab} ' + title + '\n'
+ tab_content = tab_content.strip() + '\n:::\n'
+ tabs_list.append(title + tab_content)
+
+ return '::::{tabs}\n' + ''.join(tabs_list) + '::::'
+
+ if '' in content and '' in content:
+ # Make TABS block a selctive tabs
+ try:
+ pattern = r'([\d\D]*?)'
+ content = re.sub(pattern, make_tabs, content)
+ except Exception as e:
+ warnings.warn(f'Can not parse the TABS, get an error : {e}')
+
+ with open(copy, 'w') as copy_file:
+ copy_file.write(content)
+
+
+for collection in model_index.collections:
+ generate_paper_page(collection)
+
+
+def scatter_results(models):
+ model_result_pairs = []
+ for model in models:
+ if model.results is None:
+ result = Result(task=None, dataset=None, metrics={})
+ model_result_pairs.append((model, result))
+ else:
+ for result in model.results:
+ model_result_pairs.append((model, result))
+ return model_result_pairs
+
+
+def generate_summary_table(task, model_result_pairs, title=None):
+ metrics = set()
+ for model, result in model_result_pairs:
+ if result.task == task:
+ metrics = metrics.union(result.metrics.keys())
+ metrics = sorted(list(metrics))
+
+ rows = []
+ for model, result in model_result_pairs:
+ if result.task != task:
+ continue
+ name = model.name
+ params = f'{model.metadata.parameters / 1e6:.2f}' # Params
+ if model.metadata.flops is not None:
+ flops = f'{model.metadata.flops / 1e9:.2f}' # Flops
+ else:
+ flops = None
+ readme = Path(model.collection.filepath).parent.with_suffix('.md').name
+ page = f'[link]({PAPERS_ROOT / readme})'
+ model_metrics = []
+ for metric in metrics:
+ model_metrics.append(str(result.metrics.get(metric, '')))
+
+ rows.append([name, params, flops, *model_metrics, page])
+
+ with open('modelzoo_statistics.md', 'a') as f:
+ if title is not None:
+ f.write(f'\n{title}')
+ f.write("""\n```{table}\n:class: model-summary\n""")
+ header = [
+ 'Model',
+ 'Params (M)',
+ 'Flops (G)',
+ *[METRIC_ALIAS.get(metric, metric) for metric in metrics],
+ 'Readme',
+ ]
+ table_cfg = dict(
+ tablefmt='pipe',
+ floatfmt='.2f',
+ numalign='right',
+ stralign='center')
+ f.write(tabulate(rows, header, **table_cfg))
+ f.write('\n```\n')
+
+
+def generate_dataset_wise_table(task, model_result_pairs, title=None):
+ dataset_rows = defaultdict(list)
+ for model, result in model_result_pairs:
+ if result.task == task:
+ dataset_rows[result.dataset].append((model, result))
+
+ if title is not None:
+ with open('modelzoo_statistics.md', 'a') as f:
+ f.write(f'\n{title}')
+ for dataset, pairs in dataset_rows.items():
+ generate_summary_table(task, pairs, title=f'### {dataset}')
+
+
+model_result_pairs = scatter_results(model_index.models)
+
+# Generate Pretrain Summary
+generate_summary_table(
+ task=None,
+ model_result_pairs=model_result_pairs,
+ title='## Pretrained Models',
+)
+
+# Generate Image Classification Summary
+generate_dataset_wise_table(
+ task='Image Classification',
+ model_result_pairs=model_result_pairs,
+ title='## Image Classification',
+)
+
+# Generate Multi-Label Classification Summary
+generate_dataset_wise_table(
+ task='Multi-Label Classification',
+ model_result_pairs=model_result_pairs,
+ title='## Multi-Label Classification',
+)
+
+# Generate Image Retrieval Summary
+generate_dataset_wise_table(
+ task='Image Retrieval',
+ model_result_pairs=model_result_pairs,
+ title='## Image Retrieval',
+)
diff --git a/docs/en/useful_tools/cam_visualization.md b/docs/en/useful_tools/cam_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..023e37ac2397fd315df8eace8ed1fde1c9f1abb1
--- /dev/null
+++ b/docs/en/useful_tools/cam_visualization.md
@@ -0,0 +1,164 @@
+# Class Activation Map (CAM) Visualization
+
+## Introduction of the CAM visualization tool
+
+MMPretrain provides `tools/visualization/vis_cam.py` tool to visualize class activation map. Please use `pip install "grad-cam>=1.3.6"` command to install [pytorch-grad-cam](https://github.com/jacobgil/pytorch-grad-cam).
+
+The supported methods are as follows:
+
+| Method | What it does |
+| ------------ | ---------------------------------------------------------------------------------------------------------------------------- |
+| GradCAM | Weight the 2D activations by the average gradient |
+| GradCAM++ | Like GradCAM but uses second order gradients |
+| XGradCAM | Like GradCAM but scale the gradients by the normalized activations |
+| EigenCAM | Takes the first principle component of the 2D Activations (no class discrimination, but seems to give great results) |
+| EigenGradCAM | Like EigenCAM but with class discrimination: First principle component of Activations\*Grad. Looks like GradCAM, but cleaner |
+| LayerCAM | Spatially weight the activations by positive gradients. Works better especially in lower layers |
+
+More CAM methods supported by the new version `pytorch-grad-cam` can also be used but we haven't verified the availability.
+
+**Command**:
+
+```bash
+python tools/visualization/vis_cam.py \
+ ${IMG} \
+ ${CONFIG_FILE} \
+ ${CHECKPOINT} \
+ [--target-layers ${TARGET-LAYERS}] \
+ [--preview-model] \
+ [--method ${METHOD}] \
+ [--target-category ${TARGET-CATEGORY}] \
+ [--save-path ${SAVE_PATH}] \
+ [--vit-like] \
+ [--num-extra-tokens ${NUM-EXTRA-TOKENS}]
+ [--aug_smooth] \
+ [--eigen_smooth] \
+ [--device ${DEVICE}] \
+ [--cfg-options ${CFG-OPTIONS}]
+```
+
+**Description of all arguments**:
+
+- `img`: The target picture path.
+- `config`: The path of the model config file.
+- `checkpoint`: The path of the checkpoint.
+- `--target-layers`: The target layers to get activation maps, one or more network layers can be specified. If not set, use the norm layer of the last block.
+- `--preview-model`: Whether to print all network layer names in the model.
+- `--method`: Visualization method, supports `GradCAM`, `GradCAM++`, `XGradCAM`, `EigenCAM`, `EigenGradCAM`, `LayerCAM`, which is case insensitive. Defaults to `GradCAM`.
+- `--target-category`: Target category, if not set, use the category detected by the given model.
+- `--eigen-smooth`: Whether to use the principal component to reduce noise.
+- `--aug-smooth`: Whether to use TTA(Test Time Augment) to get CAM.
+- `--save-path`: The path to save the CAM visualization image. If not set, the CAM image will not be saved.
+- `--vit-like`: Whether the network is ViT-like network.
+- `--num-extra-tokens`: The number of extra tokens in ViT-like backbones. If not set, use num_extra_tokens the backbone.
+- `--device`: The computing device used. Default to 'cpu'.
+- `--cfg-options`: Modifications to the configuration file, refer to [Learn about Configs](../user_guides/config.md).
+
+```{note}
+The argument `--preview-model` can view all network layers names in the given model. It will be helpful if you know nothing about the model layers when setting `--target-layers`.
+```
+
+## How to visualize the CAM of CNN (ResNet-50)
+
+Here are some examples of `target-layers` in ResNet-50, which can be any module or layer:
+
+- `'backbone.layer4'` means the output of the forth ResLayer.
+- `'backbone.layer4.2'` means the output of the third BottleNeck block in the forth ResLayer.
+- `'backbone.layer4.2.conv1'` means the output of the `conv1` layer in above BottleNeck block.
+
+1. Use different methods to visualize CAM for `ResNet50`, the `target-category` is the predicted result by the given checkpoint, using the default `target-layers`.
+
+ ```shell
+ python tools/visualization/vis_cam.py \
+ demo/bird.JPEG \
+ configs/resnet/resnet50_8xb32_in1k.py \
+ https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+ --method GradCAM
+ # GradCAM++, XGradCAM, EigenCAM, EigenGradCAM, LayerCAM
+ ```
+
+ | Image | GradCAM | GradCAM++ | EigenGradCAM | LayerCAM |
+ | ------------------------------------ | --------------------------------------- | ----------------------------------------- | -------------------------------------------- | ---------------------------------------- |
+ | | | | | |
+
+2. Use different `target-category` to get CAM from the same picture. In `ImageNet` dataset, the category 238 is 'Greater Swiss Mountain dog', the category 281 is 'tabby, tabby cat'.
+
+ ```shell
+ python tools/visualization/vis_cam.py \
+ demo/cat-dog.png configs/resnet/resnet50_8xb32_in1k.py \
+ https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+ --target-layers 'backbone.layer4.2' \
+ --method GradCAM \
+ --target-category 238
+ # --target-category 281
+ ```
+
+ | Category | Image | GradCAM | XGradCAM | LayerCAM |
+ | -------- | ---------------------------------------------- | ------------------------------------------------ | ------------------------------------------------- | ------------------------------------------------- |
+ | Dog | | | | |
+ | Cat | | | | |
+
+3. Use `--eigen-smooth` and `--aug-smooth` to improve visual effects.
+
+ ```shell
+ python tools/visualization/vis_cam.py \
+ demo/dog.jpg \
+ configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py \
+ https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth \
+ --target-layers 'backbone.layer16' \
+ --method LayerCAM \
+ --eigen-smooth --aug-smooth
+ ```
+
+ | Image | LayerCAM | eigen-smooth | aug-smooth | eigen&aug |
+ | ------------------------------------ | --------------------------------------- | ------------------------------------------- | ----------------------------------------- | ----------------------------------------- |
+ | | | | | |
+
+## How to visualize the CAM of vision transformer
+
+Here are some examples:
+
+- `'backbone.norm3'` for Swin-Transformer;
+- `'backbone.layers.11.ln1'` for ViT;
+
+For ViT-like networks, such as ViT, T2T-ViT and Swin-Transformer, the features are flattened. And for drawing the CAM, we need to specify the `--vit-like` argument to reshape the features into square feature maps.
+
+Besides the flattened features, some ViT-like networks also add extra tokens like the class token in ViT and T2T-ViT, and the distillation token in DeiT. In these networks, the final classification is done on the tokens computed in the last attention block, and therefore, the classification score will not be affected by other features and the gradient of the classification score with respect to them, will be zero. Therefore, you shouldn't use the output of the last attention block as the target layer in these networks.
+
+To exclude these extra tokens, we need know the number of extra tokens. Almost all transformer-based backbones in MMPretrain have the `num_extra_tokens` attribute. If you want to use this tool in a new or third-party network that don't have the `num_extra_tokens` attribute, please specify it the `--num-extra-tokens` argument.
+
+1. Visualize CAM for `Swin Transformer`, using default `target-layers`:
+
+ ```shell
+ python tools/visualization/vis_cam.py \
+ demo/bird.JPEG \
+ configs/swin_transformer/swin-tiny_16xb64_in1k.py \
+ https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth \
+ --vit-like
+ ```
+
+2. Visualize CAM for `Vision Transformer(ViT)`:
+
+ ```shell
+ python tools/visualization/vis_cam.py \
+ demo/bird.JPEG \
+ configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py \
+ https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth \
+ --vit-like \
+ --target-layers 'backbone.layers.11.ln1'
+ ```
+
+3. Visualize CAM for `T2T-ViT`:
+
+ ```shell
+ python tools/visualization/vis_cam.py \
+ demo/bird.JPEG \
+ configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py \
+ https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_3rdparty_8xb64_in1k_20210928-b7c09b62.pth \
+ --vit-like \
+ --target-layers 'backbone.encoder.12.ln1'
+ ```
+
+| Image | ResNet50 | ViT | Swin | T2T-ViT |
+| --------------------------------------- | ------------------------------------------ | -------------------------------------- | --------------------------------------- | ------------------------------------------ |
+| | | | | |
diff --git a/docs/en/useful_tools/complexity_analysis.md b/docs/en/useful_tools/complexity_analysis.md
new file mode 100644
index 0000000000000000000000000000000000000000..ac6d1334c6d18c448d5f89144b421717259d7b19
--- /dev/null
+++ b/docs/en/useful_tools/complexity_analysis.md
@@ -0,0 +1,77 @@
+# Model Complexity Analysis
+
+## Get the FLOPs and params (experimental)
+
+We provide a script adapted from [MMEngine](https://github.com/open-mmlab/mmengine/blob/main/mmengine/analysis/complexity_analysis.py) to compute the FLOPs and params of a given model.
+
+```shell
+python tools/analysis_tools/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]
+```
+
+Description of all arguments:
+
+- `config`: The path of the model config file.
+- `--shape`: Input size, support single value or double value parameter, such as `--shape 256` or `--shape 224 256`. If not set, default to be `224 224`.
+
+Example:
+
+```shell
+python tools/analysis_tools/get_flops.py configs/resnet/resnet50_8xb32_in1k.py
+```
+
+You will get the final result like this.
+
+```text
+==============================
+Input shape: (3, 224, 224)
+Flops: 4.109G
+Params: 25.557M
+Activation: 11.114M
+==============================
+```
+
+Also, you will get the detailed complexity information of each layer like this:
+
+```text
++--------------------------+----------------------+-----------+--------------+
+| module | #parameters or shape | #flops | #activations |
++--------------------------+----------------------+-----------+--------------+
+| model | 25.557M | 4.109G | 11.114M |
+| backbone | 23.508M | 4.109G | 11.114M |
+| backbone.conv1 | 9.408K | 0.118G | 0.803M |
+| backbone.conv1.weight | (64, 3, 7, 7) | | |
+| backbone.bn1 | 0.128K | 1.606M | 0 |
+| backbone.bn1.weight | (64,) | | |
+| backbone.bn1.bias | (64,) | | |
+| backbone.layer1 | 0.216M | 0.677G | 4.415M |
+| backbone.layer1.0 | 75.008K | 0.235G | 2.007M |
+| backbone.layer1.1 | 70.4K | 0.221G | 1.204M |
+| backbone.layer1.2 | 70.4K | 0.221G | 1.204M |
+| backbone.layer2 | 1.22M | 1.034G | 3.111M |
+| backbone.layer2.0 | 0.379M | 0.375G | 1.305M |
+| backbone.layer2.1 | 0.28M | 0.22G | 0.602M |
+| backbone.layer2.2 | 0.28M | 0.22G | 0.602M |
+| backbone.layer2.3 | 0.28M | 0.22G | 0.602M |
+| backbone.layer3 | 7.098M | 1.469G | 2.158M |
+| backbone.layer3.0 | 1.512M | 0.374G | 0.652M |
+| backbone.layer3.1 | 1.117M | 0.219G | 0.301M |
+| backbone.layer3.2 | 1.117M | 0.219G | 0.301M |
+| backbone.layer3.3 | 1.117M | 0.219G | 0.301M |
+| backbone.layer3.4 | 1.117M | 0.219G | 0.301M |
+| backbone.layer3.5 | 1.117M | 0.219G | 0.301M |
+| backbone.layer4 | 14.965M | 0.81G | 0.627M |
+| backbone.layer4.0 | 6.04M | 0.373G | 0.326M |
+| backbone.layer4.1 | 4.463M | 0.219G | 0.151M |
+| backbone.layer4.2 | 4.463M | 0.219G | 0.151M |
+| head.fc | 2.049M | | |
+| head.fc.weight | (1000, 2048) | | |
+| head.fc.bias | (1000,) | | |
+| neck.gap | | 0.1M | 0 |
++--------------------------+----------------------+-----------+--------------+
+```
+
+```{warning}
+This tool is still experimental and we do not guarantee that the number is correct. You may well use the result for simple comparisons, but double-check it before you adopt it in technical reports or papers.
+- FLOPs are related to the input shape while parameters are not. The default input shape is (1, 3, 224, 224).
+- Some operators are not counted into FLOPs like custom operators. Refer to [`mmengine.analysis.complexity_analysis._DEFAULT_SUPPORTED_FLOP_OPS`](https://github.com/open-mmlab/mmengine/blob/main/mmengine/analysis/complexity_analysis.py) for details.
+```
diff --git a/docs/en/useful_tools/confusion_matrix.md b/docs/en/useful_tools/confusion_matrix.md
new file mode 100644
index 0000000000000000000000000000000000000000..306b585c0d39007adf6db5899105574e7c597f17
--- /dev/null
+++ b/docs/en/useful_tools/confusion_matrix.md
@@ -0,0 +1,84 @@
+# Confusion Matrix
+
+MMPretrain provides `tools/analysis_tools/confusion_matrix.py` tool to calculate and visualize the confusion matrix. For an introduction to the confusion matrix, see [link](https://en.wikipedia.org/wiki/Confusion_matrix).
+
+## Command-line Usage
+
+**Command**:
+
+```shell
+python tools/analysis_tools/confusion_matrix.py \
+ ${CONFIG_FILE} \
+ ${CHECKPOINT} \
+ [--show] \
+ [--show-path] \
+ [--include-values] \
+ [--cmap ${CMAP}] \
+ [--cfg-options ${CFG-OPTIONS}]
+```
+
+**Description of all arguments**:
+
+- `config`: The path of the model config file.
+- `checkpoint`: The path of the checkpoint.
+- `--show`: If or not to show the matplotlib visualization result of the confusion matrix, the default is `False`.
+- `--show-path`: If `show` is True, the path where the results are saved is visualized.
+- `--include-values`: Whether to add values to the visualization results.
+- `--cmap`: The color map used for visualization results, `cmap`, which defaults to `viridis`.
+
+* `--cfg-options`: Modifications to the configuration file, refer to [Learn about Configs](../user_guides/config.md).
+
+**Examples of use**:
+
+```shell
+python tools/analysis_tools/confusion_matrix.py \
+ configs/resnet/resnet50_8xb16_cifar10.py \
+ https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth \
+ --show
+```
+
+**output image**:
+
+
+
+## **Basic Usage**
+
+```python
+>>> import torch
+>>> from mmpretrain.evaluation import ConfusionMatrix
+>>> y_pred = [0, 1, 1, 3]
+>>> y_true = [0, 2, 1, 3]
+>>> ConfusionMatrix.calculate(y_pred, y_true, num_classes=4)
+tensor([[1, 0, 0, 0],
+ [0, 1, 0, 0],
+ [0, 1, 0, 0],
+ [0, 0, 0, 1]])
+>>> # plot the confusion matrix
+>>> import matplotlib.pyplot as plt
+>>> y_score = torch.rand((1000, 10))
+>>> y_true = torch.randint(10, (1000, ))
+>>> matrix = ConfusionMatrix.calculate(y_score, y_true)
+>>> ConfusionMatrix().plot(matrix)
+>>> plt.show()
+```
+
+## **Use with Evalutor**
+
+```python
+>>> import torch
+>>> from mmpretrain.evaluation import ConfusionMatrix
+>>> from mmpretrain.structures import DataSample
+>>> from mmengine.evaluator import Evaluator
+>>> data_samples = [
+... DataSample().set_gt_label(i%5).set_pred_score(torch.rand(5))
+... for i in range(1000)
+... ]
+>>> evaluator = Evaluator(metrics=ConfusionMatrix())
+>>> evaluator.process(data_samples)
+>>> evaluator.evaluate(1000)
+{'confusion_matrix/result': tensor([[37, 37, 48, 43, 35],
+ [35, 51, 32, 46, 36],
+ [45, 28, 39, 42, 46],
+ [42, 40, 40, 35, 43],
+ [40, 39, 41, 37, 43]])}
+```
diff --git a/docs/en/useful_tools/dataset_visualization.md b/docs/en/useful_tools/dataset_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..b1f216ce68a38d9f3d6b59e5c48a61fa0f0375fe
--- /dev/null
+++ b/docs/en/useful_tools/dataset_visualization.md
@@ -0,0 +1,90 @@
+# Dataset Visualization
+
+## Introduce the dataset visualization tool
+
+```bash
+python tools/visualization/browse_dataset.py \
+ ${CONFIG_FILE} \
+ [-o, --output-dir ${OUTPUT_DIR}] \
+ [-p, --phase ${DATASET_PHASE}] \
+ [-n, --show-number ${NUMBER_IMAGES_DISPLAY}] \
+ [-i, --show-interval ${SHOW_INTERRVAL}] \
+ [-m, --mode ${DISPLAY_MODE}] \
+ [-r, --rescale-factor ${RESCALE_FACTOR}] \
+ [-c, --channel-order ${CHANNEL_ORDER}] \
+ [--cfg-options ${CFG_OPTIONS}]
+```
+
+**Description of all arguments**:
+
+- `config` : The path of a model config file.
+- `-o, --output-dir`: The output path for visualized images. If not specified, it will be set to `''`, which means not to save.
+- **`-p, --phase`**: Phase of visualizing dataset,must be one of `['train', 'val', 'test']`. If not specified, it will be set to `'train'`.
+- **`-n, --show-number`**: The number of samples to visualized. If not specified, display all images in the dataset.
+- `--show-interval`: The interval of show (s).
+- **`-m, --mode`**: The display mode, can be one of `['original', 'transformed', 'concat', 'pipeline']`. If not specified, it will be set to `'transformed'`.
+- `-r, --rescale-factor`: The image rescale factor, which is useful if the output is too large or too small
+ in the `original` mode.
+- `-c, --channel-order`: The channel of the showing images, could be "BGR" or "RGB", If not specified, it will be set to 'BGR'.
+- `--cfg-options` : Modifications to the configuration file, refer to [Learn about Configs](../user_guides/config.md).
+
+```{note}
+1. The `-m, --mode` is about display mode, display original pictures or transformed pictures or comparison pictures:
+- "original" means show images load from disk;
+- "transformed" means to show images after transformed;
+- "concat" means show images stitched by "original" and "transformed" images;
+- "pipeline" means show all the intermediate images throghout the pipeline.
+
+2. The `-r, --rescale-factor` option is set when the label information is too large or too small relative to the picture. For example, when visualizing the CIFAR dataset, since the resolution of the image is very small, `--rescale-factor` can be set to 10.
+```
+
+## How to visualize the original image
+
+In **'original'** mode:
+
+```shell
+python ./tools/visualization/browse_dataset.py ./configs/resnet/resnet101_8xb16_cifar10.py --phase val --output-dir tmp --mode original --show-number 100 --rescale-factor 10 --channel-order RGB
+```
+
+- `--phase val`: Visual validation set, can be simplified to `-p val`;
+- `--output-dir tmp`: The visualization results are saved in the "tmp" folder, can be simplified to `-o tmp`;
+- `--mode original`: Visualize the original image, can be simplified to `-m original`;
+- `--show-number 100`: visualize 100 images, can be simplified to `-n 100`;
+- `--rescale-factor`: the image is enlarged by 10 times, can be simplified to `-r 10`;
+- `--channel-order RGB`: The channel order of the visualized image is "RGB", can be simplified to `-c RGB`.
+
+
+
+## How to visualize the transformed images
+
+In **'transformed'** mode:
+
+```shell
+python ./tools/visualization/browse_dataset.py ./configs/resnet/resnet50_8xb32_in1k.py -n 100
+```
+
+
+
+## How to visualize the transformed images and original images together
+
+In **'concat'** mode:
+
+```shell
+python ./tools/visualization/browse_dataset.py configs/swin_transformer/swin-small_16xb64_in1k.py -n 10 -m concat
+```
+
+
+
+4. In **'pipeline'** mode:
+
+```shell
+python ./tools/visualization/browse_dataset.py configs/swin_transformer/swin-small_16xb64_in1k.py -m pipeline
+```
+
+
+
+```shell
+python ./tools/visualization/browse_dataset.py configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py -m pipeline
+```
+
+
diff --git a/docs/en/useful_tools/log_result_analysis.md b/docs/en/useful_tools/log_result_analysis.md
new file mode 100644
index 0000000000000000000000000000000000000000..99968d7a05937929f021c712808e8fe0ef2db3ff
--- /dev/null
+++ b/docs/en/useful_tools/log_result_analysis.md
@@ -0,0 +1,226 @@
+# Log and Results Analysis
+
+## Log Analysis
+
+### Introduction of log analysis tool
+
+`tools/analysis_tools/analyze_logs.py` plots curves of given keys according to the log files.
+
+
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve \
+ ${JSON_LOGS} \
+ [--keys ${KEYS}] \
+ [--title ${TITLE}] \
+ [--legend ${LEGEND}] \
+ [--backend ${BACKEND}] \
+ [--style ${STYLE}] \
+ [--out ${OUT_FILE}] \
+ [--window-size ${WINDOW_SIZE}]
+```
+
+**Description of all arguments**:
+
+- `json_logs` : The paths of the log files, separate multiple files by spaces.
+- `--keys` : The fields of the logs to analyze, separate multiple keys by spaces. Defaults to 'loss'.
+- `--title` : The title of the figure. Defaults to use the filename.
+- `--legend` : The names of legend, the number of which must be equal to `len(${JSON_LOGS}) * len(${KEYS})`. Defaults to use `"${JSON_LOG}-${KEYS}"`.
+- `--backend` : The backend of matplotlib. Defaults to auto selected by matplotlib.
+- `--style` : The style of the figure. Default to `whitegrid`.
+- `--out` : The path of the output picture. If not set, the figure won't be saved.
+- `--window-size`: The shape of the display window. The format should be `'W*H'`. Defaults to `'12*7'`.
+
+```{note}
+The `--style` option depends on `seaborn` package, please install it before setting it.
+```
+
+### How to plot the loss/accuracy curve
+
+We present some examples here to show how to plot the loss curve of accuracy curve by using the `tools/analysis_tools/analyze_logs.py`
+
+#### Plot the loss curve in training.
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys loss --legend loss
+```
+
+#### Plot the top-1 accuracy and top-5 accuracy curves, and save the figure to results.jpg.
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys accuracy/top1 accuracy/top5 --legend top1 top5 --out results.jpg
+```
+
+#### Compare the top-1 accuracy of two log files in the same figure.
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve log1.json log2.json --keys accuracy/top1 --legend exp1 exp2
+```
+
+### How to calculate training time
+
+`tools/analysis_tools/analyze_logs.py` can also calculate the training time according to the log files.
+
+```shell
+python tools/analysis_tools/analyze_logs.py cal_train_time \
+ ${JSON_LOGS}
+ [--include-outliers]
+```
+
+**Description of all arguments**:
+
+- `json_logs` : The paths of the log files, separate multiple files by spaces.
+- `--include-outliers` : If set, include the first time record in each epoch (Sometimes the time of the first iteration is longer).
+
+Example:
+
+```shell
+python tools/analysis_tools/analyze_logs.py cal_train_time work_dirs/your_exp/20230206_181002/vis_data/scalars.json
+```
+
+The output is expected to be like the below.
+
+```text
+-----Analyze train time of work_dirs/your_exp/20230206_181002/vis_data/scalars.json-----
+slowest epoch 68, average time is 0.3818
+fastest epoch 1, average time is 0.3694
+time std over epochs is 0.0020
+average iter time: 0.3777 s/iter
+```
+
+## Result Analysis
+
+With the `--out` argument in `tools/test.py`, we can save the inference results of all samples as a file.
+And with this result file, we can do further analysis.
+
+### How to conduct offline metric evaluation
+
+We provide `tools/analysis_tools/eval_metric.py` to enable the user evaluate the model from the prediction files.
+
+```shell
+python tools/analysis_tools/eval_metric.py \
+ ${RESULT} \
+ [--metric ${METRIC_OPTIONS} ...]
+```
+
+Description of all arguments:
+
+- `result`: The output result file in pickle format from `tools/test.py`.
+- `--metric`: The metric and options to evaluate the results. You need to specify at least one metric and you
+ can also specify multiple `--metric` to use multiple metrics.
+
+Please refer the [Metric Documentation](mmpretrain.evaluation) to find the available metrics and options.
+
+```{note}
+In `tools/test.py`, we support using `--out-item` option to select which kind of results will be saved.
+Please ensure the `--out-item` is not specified or `--out-item=pred` to use this tool.
+```
+
+**Examples**:
+
+```shell
+# Get the prediction results
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py \
+ https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth \
+ --out results.pkl
+
+# Eval the top-1 and top-5 accuracy
+python tools/analysis_tools/eval_metric.py results.pkl --metric type=Accuracy topk=1,5
+
+# Eval the overall accuracy and the class-wise precision, recall, f1-score
+python tools/analysis_tools/eval_metric.py results.pkl --metric type=Accuracy \
+ --metric type=SingleLabelMetric items=precision,recall,f1-score average=None
+```
+
+### How to plot the confusion matrix for the test result
+
+We provide `tools/analysis_tools/confusion_matrix.py` to enable the user plot the confusion matrix from the prediction files.
+
+```shell
+python tools/analysis_tools/confusion_matrix.py \
+ ${CONFIG} \
+ ${RESULT} \
+ [--out ${OUT}] \
+ [--show] \
+ [--show-path ${SHOW_PATH}] \
+ [--include-values] \
+ [--cmap] \
+ [--cfg-options ${CFG_OPTIONS} ...] \
+```
+
+Description of all arguments:
+
+- `config`: The config file path.
+- `result`: The output result file in pickle format from `tools/test.py`, or a checkpoint file.
+- `--out`: The path to save the confusion matrix in pickle format.
+- `--show`: Whether to show the confusion matrix plot.
+- `--show-path`: The path to save the confusion matrix plot.
+- `--include-values`: Whether to show the values in the confusion matrix plot.
+- `--cmap`: The color map to plot the confusion matrix.
+- `--cfg-options`: If specified, the key-value pair config will be merged into the config file, for more details please refer to [Learn about Configs](../user_guides/config.md)
+
+```{note}
+In `tools/test.py`, we support using `--out-item` option to select which kind of results will be saved.
+Please ensure the `--out-item` is not specified or `--out-item=pred` to use this tool.
+```
+
+**Examples**:
+
+```shell
+# Get the prediction results
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py \
+ https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth \
+ --out results.pkl
+
+# Save the confusion matrix in a pickle file
+python tools/analysis_tools/confusion_matrix.py configs/resnet/resnet18_8xb16_cifar10.py results.pkl --out cm.pkl
+
+# Show the confusion matrix plot in a graphical window.
+python tools/analysis_tools/confusion_matrix.py configs/resnet/resnet18_8xb16_cifar10.py results.pkl --show
+```
+
+### How to visualize the prediction results
+
+We can use `tools/analysis_tools/analyze_results.py` to save the images with the highest scores in successful or failed prediction.
+
+```shell
+python tools/analysis_tools/analyze_results.py \
+ ${CONFIG} \
+ ${RESULT} \
+ [--out-dir ${OUT_DIR}] \
+ [--topk ${TOPK}] \
+ [--rescale-factor ${RESCALE_FACTOR}] \
+ [--cfg-options ${CFG_OPTIONS}]
+```
+
+**Description of all arguments**:
+
+- `config` : The path of the model config file.
+- `result`: Output result file in json/pickle format from `tools/test.py`.
+- `--out_dir`: Directory to store output files.
+- `--topk`: The number of images in successful or failed prediction with the highest `topk` scores to save. If not specified, it will be set to 20.
+- `--rescale-factor`: Image rescale factor, which is useful if the output is too large or too small (Too small
+ images may cause the prediction tag is too vague).
+- `--cfg-options`: If specified, the key-value pair config will be merged into the config file, for more details please refer to [Learn about Configs](../user_guides/config.md)
+
+```{note}
+In `tools/test.py`, we support using `--out-item` option to select which kind of results will be saved.
+Please ensure the `--out-item` is not specified or `--out-item=pred` to use this tool.
+```
+
+**Examples**:
+
+```shell
+# Get the prediction results
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py \
+ https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth \
+ --out results.pkl
+
+# Save the top-10 successful and failed predictions. And enlarge the sample images by 10 times.
+python tools/analysis_tools/analyze_results.py \
+ configs/resnet/resnet18_8xb16_cifar10.py \
+ results.pkl \
+ --out-dir output \
+ --topk 10 \
+ --rescale-factor 10
+```
diff --git a/docs/en/useful_tools/model_serving.md b/docs/en/useful_tools/model_serving.md
new file mode 100644
index 0000000000000000000000000000000000000000..9f135fbf5c95ba35fc2b794afdaf9b0f0f0c2ec6
--- /dev/null
+++ b/docs/en/useful_tools/model_serving.md
@@ -0,0 +1,88 @@
+# Torchserve Deployment
+
+In order to serve an `MMPretrain` model with [`TorchServe`](https://pytorch.org/serve/), you can follow the steps:
+
+## 1. Convert model from MMPretrain to TorchServe
+
+```shell
+python tools/torchserve/mmpretrain2torchserve.py ${CONFIG_FILE} ${CHECKPOINT_FILE} \
+--output-folder ${MODEL_STORE} \
+--model-name ${MODEL_NAME}
+```
+
+```{note}
+${MODEL_STORE} needs to be an absolute path to a folder.
+```
+
+Example:
+
+```shell
+python tools/torchserve/mmpretrain2torchserve.py \
+ configs/resnet/resnet18_8xb32_in1k.py \
+ checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
+ --output-folder ./checkpoints \
+ --model-name resnet18_in1k
+```
+
+## 2. Build `mmpretrain-serve` docker image
+
+```shell
+docker build -t mmpretrain-serve:latest docker/serve/
+```
+
+## 3. Run `mmpretrain-serve`
+
+Check the official docs for [running TorchServe with docker](https://github.com/pytorch/serve/blob/master/docker/README.md#running-torchserve-in-a-production-docker-environment).
+
+In order to run in GPU, you need to install [nvidia-docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). You can omit the `--gpus` argument in order to run in GPU.
+
+Example:
+
+```shell
+docker run --rm \
+--name mar \
+--cpus 8 \
+--gpus device=0 \
+-p8080:8080 -p8081:8081 -p8082:8082 \
+--mount type=bind,source=`realpath ./checkpoints`,target=/home/model-server/model-store \
+mmpretrain-serve:latest
+```
+
+```{note}
+`realpath ./checkpoints` points to the absolute path of "./checkpoints", and you can replace it with the absolute path where you store torchserve models.
+```
+
+[Read the docs](https://github.com/pytorch/serve/blob/master/docs/rest_api.md) about the Inference (8080), Management (8081) and Metrics (8082) APis
+
+## 4. Test deployment
+
+```shell
+curl http://127.0.0.1:8080/predictions/${MODEL_NAME} -T demo/demo.JPEG
+```
+
+You should obtain a response similar to:
+
+```json
+{
+ "pred_label": 58,
+ "pred_score": 0.38102269172668457,
+ "pred_class": "water snake"
+}
+```
+
+And you can use `test_torchserver.py` to compare result of TorchServe and PyTorch, and visualize them.
+
+```shell
+python tools/torchserve/test_torchserver.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE} ${MODEL_NAME}
+[--inference-addr ${INFERENCE_ADDR}] [--device ${DEVICE}]
+```
+
+Example:
+
+```shell
+python tools/torchserve/test_torchserver.py \
+ demo/demo.JPEG \
+ configs/resnet/resnet18_8xb32_in1k.py \
+ checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
+ resnet18_in1k
+```
diff --git a/docs/en/useful_tools/print_config.md b/docs/en/useful_tools/print_config.md
new file mode 100644
index 0000000000000000000000000000000000000000..ea4076475b4fdf1ee6f158e49b115abeabf2336c
--- /dev/null
+++ b/docs/en/useful_tools/print_config.md
@@ -0,0 +1,27 @@
+# How to Get the Complete Config
+
+We also provide the `print_config.py` tools to print the complete configuration of the given experiment.
+You can check each item of the config before the training by using the following command.
+
+## Description
+
+`tools/misc/print_config.py` prints the whole config verbatim, expanding all its imports.
+
+```shell
+python tools/misc/print_config.py ${CONFIG} [--cfg-options ${CFG_OPTIONS}]
+```
+
+Description of all arguments:
+
+- `config` : The path of the model config file.
+- `--cfg-options`: If specified, the key-value pair config will be merged into the config file, for more details please refer to [Learn about Configs](../user_guides/config.md)
+
+## Examples
+
+```shell
+# Print a complete config
+python tools/misc/print_config.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
+
+# Save the complete config to a independent config file.
+python tools/misc/print_config.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py > final_config.py
+```
diff --git a/docs/en/useful_tools/scheduler_visualization.md b/docs/en/useful_tools/scheduler_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..0ba1bdc4ff96d678985522b17dd539dc0964f1a9
--- /dev/null
+++ b/docs/en/useful_tools/scheduler_visualization.md
@@ -0,0 +1,44 @@
+# Hyper-parameter Scheduler Visualization
+
+This tool aims to help the user to check the hyper-parameter scheduler of the optimizer (without training), which support the "learning rate" or "momentum"
+
+## Introduce the scheduler visualization tool
+
+```bash
+python tools/visualization/vis_scheduler.py \
+ ${CONFIG_FILE} \
+ [-p, --parameter ${PARAMETER_NAME}] \
+ [-d, --dataset-size ${DATASET_SIZE}] \
+ [-n, --ngpus ${NUM_GPUs}] \
+ [-s, --save-path ${SAVE_PATH}] \
+ [--title ${TITLE}] \
+ [--style ${STYLE}] \
+ [--window-size ${WINDOW_SIZE}] \
+ [--cfg-options]
+```
+
+**Description of all arguments**:
+
+- `config`: The path of a model config file.
+- **`-p, --parameter`**: The param to visualize its change curve, choose from "lr" and "momentum". Default to use "lr".
+- **`-d, --dataset-size`**: The size of the datasets. If set,`build_dataset` will be skipped and `${DATASET_SIZE}` will be used as the size. Default to use the function `build_dataset`.
+- **`-n, --ngpus`**: The number of GPUs used in training, default to be 1.
+- **`-s, --save-path`**: The learning rate curve plot save path, default not to save.
+- `--title`: Title of figure. If not set, default to be config file name.
+- `--style`: Style of plt. If not set, default to be `whitegrid`.
+- `--window-size`: The shape of the display window. If not specified, it will be set to `12*7`. If used, it must be in the format `'W*H'`.
+- `--cfg-options`: Modifications to the configuration file, refer to [Learn about Configs](../user_guides/config.md).
+
+```{note}
+Loading annotations maybe consume much time, you can directly specify the size of the dataset with `-d, dataset-size` to save time.
+```
+
+## How to plot the learning rate curve without training
+
+You can use the following command to plot the step learning rate schedule used in the config `configs/swin_transformer/swin-base_16xb64_in1k.py`:
+
+```bash
+python tools/visualization/vis_scheduler.py configs/swin_transformer/swin-base_16xb64_in1k.py --dataset-size 1281167 --ngpus 16
+```
+
+
diff --git a/docs/en/useful_tools/shape_bias.md b/docs/en/useful_tools/shape_bias.md
new file mode 100644
index 0000000000000000000000000000000000000000..907bde61ee7f1d86e839b2b32c694c3270a2298a
--- /dev/null
+++ b/docs/en/useful_tools/shape_bias.md
@@ -0,0 +1,100 @@
+# Shape Bias Tool Usage
+
+Shape bias measures how a model relies the shapes, compared to texture, to sense the semantics in images. For more details,
+we recommend interested readers to this [paper](https://arxiv.org/abs/2106.07411). MMPretrain provide an off-the-shelf toolbox to
+obtain the shape bias of a classification model. You can following these steps below:
+
+## Prepare the dataset
+
+First you should download the [cue-conflict](https://github.com/bethgelab/model-vs-human/releases/download/v0.1/cue-conflict.tar.gz) to `data` folder,
+and then unzip this dataset. After that, you `data` folder should have the following structure:
+
+```text
+data
+├──cue-conflict
+| |──airplane
+| |──bear
+| ...
+| |── truck
+```
+
+## Modify the config for classification
+
+We run the shape-bias tool on a ViT-base model with masked autoencoder pretraining. Its config file is `configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py`, and its checkpoint is downloaded from [this link](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth). Replace the original test_pipeline, test_dataloader and test_evaluation with the following configurations:
+
+```python
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='ResizeEdge',
+ scale=256,
+ edge='short',
+ backend='pillow'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+test_dataloader = dict(
+ pin_memory=True,
+ collate_fn=dict(type='default_collate'),
+ batch_size=32,
+ num_workers=4,
+ dataset=dict(
+ type='CustomDataset',
+ data_root='data/cue-conflict',
+ pipeline=test_pipeline,
+ _delete_=True),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+ drop_last=False)
+test_evaluator = dict(
+ type='mmpretrain.ShapeBiasMetric',
+ _delete_=True,
+ csv_dir='work_dirs/shape_bias',
+ model_name='mae')
+```
+
+Please note you should make custom modifications to the `csv_dir` and `model_name` above. I renamed my modified sample config file as `vit-base-p16_8xb128-coslr-100e_in1k_shape-bias.py` in the folder `configs/mae/benchmarks/`.
+
+## Inference your model with above modified config file
+
+Then you should inferece your model on the `cue-conflict` dataset with the your modified config file.
+
+```shell
+# For PyTorch
+bash tools/dist_test.sh $CONFIG $CHECKPOINT
+```
+
+**Description of all arguments**:
+
+- `$CONFIG`: The path of your modified config file.
+- `$CHECKPOINT`: The path or link of the checkpoint file.
+
+```shell
+# Example
+bash tools/dist_test.sh configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k_shape-bias.py https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth 1
+```
+
+After that, you should obtain a csv file in `csv_dir` folder, named `cue-conflict_model-name_session-1.csv`. Besides this file, you should also download these [csv files](https://github.com/bethgelab/model-vs-human/tree/master/raw-data/cue-conflict) to the
+`csv_dir`.
+
+## Plot shape bias
+
+Then we can start to plot the shape bias:
+
+```shell
+python tools/analysis_tools/shape_bias.py --csv-dir $CSV_DIR --result-dir $RESULT_DIR --colors $RGB --markers o --plotting-names $YOUR_MODEL_NAME --model-names $YOUR_MODEL_NAME
+```
+
+**Description of all arguments**:
+
+- `--csv-dir $CSV_DIR`, the same directory to save these csv files.
+- `--result-dir $RESULT_DIR`, the directory to output the result named `cue-conflict_shape-bias_matrixplot.pdf`.
+- `--colors $RGB`, should be the RGB values, formatted in R G B, e.g. 100 100 100, and can be multiple RGB values, if you want to plot the shape bias of several models.
+- `--plotting-names $YOUR_MODEL_NAME`, the name of the legend in the shape bias figure, and you can set it as your model name. If you want to plot several models, plotting_names can be multiple values.
+- `model-names $YOUR_MODEL_NAME`, should be the same name specified in your config, and can be multiple names if you want to plot the shape bias of several models.
+
+Please note, every three values for `--colors` corresponds to one value for `--model-names`. After all of above steps, you are expected to obtain the following figure.
+
+
+

+
diff --git a/docs/en/useful_tools/t-sne_visualization.md b/docs/en/useful_tools/t-sne_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..9f24a114dbe6e70a1d3b7beae6f2c98967008113
--- /dev/null
+++ b/docs/en/useful_tools/t-sne_visualization.md
@@ -0,0 +1,85 @@
+# t-Distributed Stochastic Neighbor Embedding (t-SNE) Visualization
+
+## Introduction of the t-SNE visualization tool
+
+MMPretrain provides `tools/visualization/vis_tsne.py` tool to visualize the feature embeddings of images by t-SNE. Please install `sklearn` to calculate t-SNE by `pip install scikit-learn`.
+
+**Command**:
+
+```bash
+python tools/visualization/vis_tsne.py \
+ CONFIG \
+ [--checkpoint CHECKPOINT] \
+ [--work-dir WORK_DIR] \
+ [--test-cfg TEST_CFG] \
+ [--vis-stage {backbone,neck,pre_logits}]
+ [--class-idx ${CLASS_IDX} [CLASS_IDX ...]]
+ [--max-num-class MAX_NUM_CLASS]
+ [--max-num-samples MAX_NUM_SAMPLES]
+ [--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]]
+ [--device DEVICE]
+ [--legend]
+ [--show]
+ [--n-components N_COMPONENTS]
+ [--perplexity PERPLEXITY]
+ [--early-exaggeration EARLY_EXAGGERATION]
+ [--learning-rate LEARNING_RATE]
+ [--n-iter N_ITER]
+ [--n-iter-without-progress N_ITER_WITHOUT_PROGRESS]
+ [--init INIT]
+```
+
+**Description of all arguments**:
+
+- `CONFIG`: The path of t-SNE config file.
+- `--checkpoint CHECKPOINT`: The path of the checkpoint file.
+- `--work-dir WORK_DIR`: The directory to save logs and visualization images.
+- `--test-cfg TEST_CFG`: The path of t-SNE config file to load config of test dataloader.
+- `--vis-stage {backbone,neck,pre_logits}`: The visualization stage of the model.
+- `--class-idx CLASS_IDX [CLASS_IDX ...]`: The categories used to calculate t-SNE.
+- `--max-num-class MAX_NUM_CLASS`: The first N categories to apply t-SNE algorithms. Defaults to 20.
+- `--max-num-samples MAX_NUM_SAMPLES`: The maximum number of samples per category. Higher number need longer time to calculate. Defaults to 100.
+- `--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]`: override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. If the value to be overwritten is a list, it should be like key="[a,b]" or key=a,b It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" Note that the quotation marks are necessary and that no white space is allowed.
+- `--device DEVICE`: Device used for inference.
+- `--legend`: Show the legend of all categories.
+- `--show`: Display the result in a graphical window.
+- `--n-components N_COMPONENTS`: The dimension of results.
+- `--perplexity PERPLEXITY`: The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms.
+- `--early-exaggeration EARLY_EXAGGERATION`: Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them.
+- `--learning-rate LEARNING_RATE`: The learning rate for t-SNE is usually in the range[10.0, 1000.0]. If the learning rate is too high, the data may looklike a ball with any point approximately equidistant from its nearestneighbours. If the learning rate is too low, most points may lookcompressed in a dense cloud with few outliers.
+- `--n-iter N_ITER`: Maximum number of iterations for the optimization. Should be at least 250.
+- `--n-iter-without-progress N_ITER_WITHOUT_PROGRESS`: Maximum number of iterations without progress before we abort the optimization.
+- `--init INIT`: The init method.
+
+## How to visualize the t-SNE of a image classifier (such as ResNet)
+
+Here are two examples of running t-SNE visualization on ResNet-18 and ResNet-50 models, trained on CIFAR-10 dataset:
+
+```shell
+python tools/visualization/vis_tsne.py \
+ configs/resnet/resnet18_8xb16_cifar10.py \
+ --checkpoint https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth
+
+python tools/visualization/vis_tsne.py \
+ configs/resnet/resnet50_8xb16_cifar10.py \
+ --checkpoint https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth
+```
+
+| ResNet-18 | ResNet-50 |
+| ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
+| 
| 
|
+
+## How to visualize the t-SNE of a self-supervised model (such as MAE)
+
+Here is an example of running t-SNE visualization on MAE-ViT-base model, trained on ImageNet dataset. The input data is from ImageNet validation set. MAE and some self-supervised pre-training algorithms do not have test_dataloader information. When analyzing such self-supervised algorithms, you need to add test_dataloader information in the config, or you can use '--test-cfg' argument to specify a config file.
+
+```shell
+python tools/visualization/vis_tsne.py \
+ configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py \
+ --checkpoint https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth \
+ --test-cfg configs/_base_/datasets/imagenet_bs32.py
+```
+
+| MAE-ViT-base |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| 
|
diff --git a/docs/en/useful_tools/verify_dataset.md b/docs/en/useful_tools/verify_dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..d27948f44b9980bf76bd6a582a13219667c4e683
--- /dev/null
+++ b/docs/en/useful_tools/verify_dataset.md
@@ -0,0 +1,28 @@
+# Verify Dataset
+
+In MMPretrain, we also provide a tool `tools/misc/verify_dataset.py` to check whether there exists **broken pictures** in the given dataset.
+
+## Introduce the tool
+
+```shell
+python tools/print_config.py \
+ ${CONFIG} \
+ [--out-path ${OUT-PATH}] \
+ [--phase ${PHASE}] \
+ [--num-process ${NUM-PROCESS}]
+ [--cfg-options ${CFG_OPTIONS}]
+```
+
+**Description of all arguments**:
+
+- `config` : The path of the model config file.
+- `--out-path` : The path to save the verification result, if not set, defaults to 'brokenfiles.log'.
+- `--phase` : Phase of dataset to verify, accept "train" "test" and "val", if not set, defaults to "train".
+- `--num-process` : number of process to use, if not set, defaults to 1.
+- `--cfg-options`: If specified, the key-value pair config will be merged into the config file, for more details please refer to [Learn about Configs](../user_guides/config.md)
+
+## Example
+
+```shell
+python tools/misc/verify_dataset.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py --out-path broken_imgs.log --phase val --num-process 8
+```
diff --git a/docs/en/user_guides/config.md b/docs/en/user_guides/config.md
new file mode 100644
index 0000000000000000000000000000000000000000..6077c707df0d87a25f746b7265ce6e0a4eec92d8
--- /dev/null
+++ b/docs/en/user_guides/config.md
@@ -0,0 +1,421 @@
+# Learn about Configs
+
+To manage various configurations in a deep-learning experiment, we use a kind of config file to record all of
+these configurations. This config system has a modular and inheritance design, and more details can be found in
+{external+mmengine:doc}`the tutorial in MMEngine `.
+
+Usually, we use python files as config file. All configuration files are placed under the [`configs`](https://github.com/open-mmlab/mmpretrain/tree/main/configs) folder, and the directory structure is as follows:
+
+```text
+MMPretrain/
+ ├── configs/
+ │ ├── _base_/ # primitive configuration folder
+ │ │ ├── datasets/ # primitive datasets
+ │ │ ├── models/ # primitive models
+ │ │ ├── schedules/ # primitive schedules
+ │ │ └── default_runtime.py # primitive runtime setting
+ │ ├── beit/ # BEiT Algorithms Folder
+ │ ├── mae/ # MAE Algorithms Folder
+ │ ├── mocov2/ # MoCoV2 Algorithms Folder
+ │ ├── resnet/ # ResNet Algorithms Folder
+ │ ├── swin_transformer/ # Swin Algorithms Folder
+ │ ├── vision_transformer/ # ViT Algorithms Folder
+ │ ├── ...
+ └── ...
+```
+
+If you wish to inspect the config file, you may run `python tools/misc/print_config.py /PATH/TO/CONFIG` to see the complete config.
+
+This article mainly explains the structure of configuration files, and how to modify it based on the existing configuration files. We will take [ResNet50 config file](https://github.com/open-mmlab/mmpretrain/blob/main/configs/resnet/resnet50_8xb32_in1k.py) as an example and explain it line by line.
+
+## Config Structure
+
+There are four kinds of basic component files in the `configs/_base_` folders, namely:
+
+- [models](https://github.com/open-mmlab/mmpretrain/tree/main/configs/_base_/models)
+- [datasets](https://github.com/open-mmlab/mmpretrain/tree/main/configs/_base_/datasets)
+- [schedules](https://github.com/open-mmlab/mmpretrain/tree/main/configs/_base_/schedules)
+- [runtime](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/default_runtime.py)
+
+We call all the config files in the `_base_` folder as _primitive_ config files. You can easily build your training config file by inheriting some primitive config files.
+
+For easy understanding, we use [ResNet50 config file](https://github.com/open-mmlab/mmpretrain/blob/main/configs/resnet/resnet50_8xb32_in1k.py) as an example and comment on each line.
+
+```python
+_base_ = [ # This config file will inherit all config files in `_base_`.
+ '../_base_/models/resnet50.py', # model settings
+ '../_base_/datasets/imagenet_bs32.py', # data settings
+ '../_base_/schedules/imagenet_bs256.py', # schedule settings
+ '../_base_/default_runtime.py' # runtime settings
+]
+```
+
+We will explain the four primitive config files separately below.
+
+### Model settings
+
+This primitive config file includes a dict variable `model`, which mainly includes information such as network structure and loss function:
+
+- `type`: The type of model to build, we support several tasks.
+ - For image classification tasks, it's usually `ImageClassifier` You can find more details in the [API documentation](mmpretrain.models.classifiers).
+ - For self-supervised leanrning, there are several `SelfSupervisors`, such as `MoCoV2`, `BEiT`, `MAE`, etc. You can find more details in the [API documentation](mmpretrain.models.selfsup).
+ - For image retrieval tasks, it's usually `ImageToImageRetriever` You can find more details in the [API documentation](mmpretrain.models.retrievers).
+
+Usually, we use the **`type` field** to specify the class of the component and use other fields to pass
+the initialization arguments of the class. The {external+mmengine:doc}`registry tutorial ` describes it in detail.
+
+Here, we use the config fields of [`ImageClassifier`](mmpretrain.models.classifiers.ImageClassifier) as an example to
+describe the initialization arguments as below:
+
+- `backbone`: The settings of the backbone. The backbone is the main network to extract features of the inputs, like `ResNet`, `Swin Transformer`, `Vision Transformer` etc. All available backbones can be found in the [API documentation](mmpretrain.models.backbones).
+ - For self-supervised leanrning, some of the backbones are re-implemented, you can find more details in the [API documentation](mmpretrain.models.selfsup).
+- `neck`: The settings of the neck. The neck is the intermediate module to connect the backbone and the head, like `GlobalAveragePooling`. All available necks can be found in the [API documentation](mmpretrain.models.necks).
+- `head`: The settings of the task head. The head is the task-related component to do a specified task, like image classification or self-supervised training. All available heads can be found in the [API documentation](mmpretrain.models.heads).
+ - `loss`: The loss function to optimize, like `CrossEntropyLoss`, `LabelSmoothLoss`, `PixelReconstructionLoss` and etc. All available losses can be found in the [API documentation](mmpretrain.models.losses).
+- `data_preprocessor`: The component before the model forwarding to preprocess the inputs. See the [documentation](mmpretrain.models.utils.data_preprocessor) for more details.
+- `train_cfg`: The extra settings of `ImageClassifier` during training. In `ImageClassifier`, we mainly use it to specify batch augmentation settings, like `Mixup` and `CutMix`. See the [documentation](mmpretrain.models.utils.batch_augments) for more details.
+
+Following is the model primitive config of the ResNet50 config file in [`configs/_base_/models/resnet50.py`](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/models/resnet50.py):
+
+```python
+model = dict(
+ type='ImageClassifier', # The type of the main model (here is for image classification task).
+ backbone=dict(
+ type='ResNet', # The type of the backbone module.
+ # All fields except `type` come from the __init__ method of class `ResNet`
+ # and you can find them from https://mmpretrain.readthedocs.io/en/latest/api/generated/mmpretrain.models.backbones.ResNet.html
+ depth=50,
+ num_stages=4,
+ out_indices=(3, ),
+ frozen_stages=-1,
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'), # The type of the neck module.
+ head=dict(
+ type='LinearClsHead', # The type of the classification head module.
+ # All fields except `type` come from the __init__ method of class `LinearClsHead`
+ # and you can find them from https://mmpretrain.readthedocs.io/en/latest/api/generated/mmpretrain.models.heads.LinearClsHead.html
+ num_classes=1000,
+ in_channels=2048,
+ loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+ ))
+```
+
+### Data settings
+
+This primitive config file includes information to construct the dataloader and evaluator:
+
+- `data_preprocessor`: Model input preprocessing configuration, same as `model.data_preprocessor` but with lower priority.
+- `train_evaluator | val_evaluator | test_evaluator`: To build the evaluator or metrics, refer to the [tutorial](mmpretrain.evaluation).
+- `train_dataloader | val_dataloader | test_dataloader`: The settings of dataloaders
+ - `batch_size`: The batch size of each GPU.
+ - `num_workers`: The number of workers to fetch data of each GPU.
+ - `sampler`: The settings of the sampler.
+ - `persistent_workers`: Whether to persistent workers after finishing one epoch.
+ - `dataset`: The settings of the dataset.
+ - `type`: The type of the dataset, we support `CustomDataset`, `ImageNet` and many other datasets, refer to [documentation](mmpretrain.datasets).
+ - `pipeline`: The data transform pipeline. You can find how to design a pipeline in [this tutorial](https://mmpretrain.readthedocs.io/en/latest/tutorials/data_pipeline.html).
+
+Following is the data primitive config of the ResNet50 config in [`configs/_base_/datasets/imagenet_bs32.py`](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/datasets/imagenet_bs32.py):
+
+```python
+dataset_type = 'ImageNet'
+# preprocessing configuration
+data_preprocessor = dict(
+ # Input image data channels in 'RGB' order
+ mean=[123.675, 116.28, 103.53], # Input image normalized channel mean in RGB order
+ std=[58.395, 57.12, 57.375], # Input image normalized channel std in RGB order
+ to_rgb=True, # Whether to flip the channel from BGR to RGB or RGB to BGR
+)
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'), # read image
+ dict(type='RandomResizedCrop', scale=224), # Random scaling and cropping
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'), # random horizontal flip
+ dict(type='PackInputs'), # prepare images and labels
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'), # read image
+ dict(type='ResizeEdge', scale=256, edge='short'), # Scale the short side to 256
+ dict(type='CenterCrop', crop_size=224), # center crop
+ dict(type='PackInputs'), # prepare images and labels
+]
+
+# Construct training set dataloader
+train_dataloader = dict(
+ batch_size=32, # batchsize per GPU
+ num_workers=5, # Number of workers to fetch data per GPU
+ dataset=dict( # training dataset
+ type=dataset_type,
+ data_root='data/imagenet',
+ ann_file='meta/train.txt',
+ data_prefix='train',
+ pipeline=train_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=True), # default sampler
+ persistent_workers=True, # Whether to keep the process, can shorten the preparation time of each epoch
+)
+
+# Construct the validation set dataloader
+val_dataloader = dict(
+ batch_size=32,
+ num_workers=5,
+ dataset=dict(
+ type=dataset_type,
+ data_root='data/imagenet',
+ ann_file='meta/val.txt',
+ data_prefix='val',
+ pipeline=test_pipeline),
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ persistent_workers=True,
+)
+# The settings of the evaluation metrics for validation. We use the top1 and top5 accuracy here.
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+test_dataloader = val_dataloader # The settings of the dataloader for the test dataset, which is the same as val_dataloader
+test_evaluator = val_evaluator # The settings of the evaluation metrics for test, which is the same as val_evaluator
+```
+
+```{note}
+The data preprocessor can be defined either in the subfield of `model`, or a using the `data_preprocessor` definition here, if both of them exist, use the `model.data_preprocessor` configuration.
+```
+
+### Schedule settings
+
+This primitive config file mainly contains training strategy settings and the settings of training, val and
+test loops:
+
+- `optim_wrapper`: The settings of the optimizer wrapper. We use the optimizer wrapper to customize the
+ optimization process.
+ - `optimizer`: Supports all `pytorch` optimizers, refers to the relevant {external+mmengine:doc}`MMEngine documentation `.
+ - `paramwise_cfg`: To set different optimization arguments according to the parameters' type or name, refer to the relevant [learning policy documentation](../advanced_guides/schedule.md).
+ - `accumulative_counts`: Optimize parameters after several backward steps instead of one backward step. You
+ can use it to simulate large batch size by small batch size.
+- `param_scheduler`: Optimizer parameters policy. You can use it to specify learning rate and momentum curves during training. See the {external+mmengine:doc}`documentation ` in MMEngine for more details.
+- `train_cfg | val_cfg | test_cfg`: The settings of the training, validation and test loops, refer to the relevant {external+mmengine:doc}`MMEngine documentation `.
+
+Following is the schedule primitive config of the ResNet50 config in [`configs/_base_/datasets/imagenet_bs32.py`](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/datasets/imagenet_bs32.py):
+
+```python
+optim_wrapper = dict(
+ # Use SGD optimizer to optimize parameters.
+ optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# The tuning strategy of the learning rate.
+# The 'MultiStepLR' means to use multiple steps policy to schedule the learning rate (LR).
+param_scheduler = dict(
+ type='MultiStepLR', by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+
+# Training configuration, iterate 100 epochs, and perform validation after every training epoch.
+# 'by_epoch=True' means to use `EpochBaseTrainLoop`, 'by_epoch=False' means to use IterBaseTrainLoop.
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+# Use the default val loop settings.
+val_cfg = dict()
+# Use the default test loop settings.
+test_cfg = dict()
+
+# This schedule is for the total batch size 256.
+# If you use a different total batch size, like 512 and enable auto learning rate scaling.
+# We will scale up the learning rate to 2 times.
+auto_scale_lr = dict(base_batch_size=256)
+```
+
+### Runtime settings
+
+This part mainly includes saving the checkpoint strategy, log configuration, training parameters, breakpoint weight path, working directory, etc.
+
+Here is the runtime primitive config file ['configs/_base_/default_runtime.py'](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/default_runtime.py) file used by almost all configs:
+
+```python
+# defaults to use registries in mmpretrain
+default_scope = 'mmpretrain'
+
+# configure default hooks
+default_hooks = dict(
+ # record the time of every iteration.
+ timer=dict(type='IterTimerHook'),
+
+ # print log every 100 iterations.
+ logger=dict(type='LoggerHook', interval=100),
+
+ # enable the parameter scheduler.
+ param_scheduler=dict(type='ParamSchedulerHook'),
+
+ # save checkpoint per epoch.
+ checkpoint=dict(type='CheckpointHook', interval=1),
+
+ # set sampler seed in a distributed environment.
+ sampler_seed=dict(type='DistSamplerSeedHook'),
+
+ # validation results visualization, set True to enable it.
+ visualization=dict(type='VisualizationHook', enable=False),
+)
+
+# configure environment
+env_cfg = dict(
+ # whether to enable cudnn benchmark
+ cudnn_benchmark=False,
+
+ # set multi-process parameters
+ mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+ # set distributed parameters
+ dist_cfg=dict(backend='nccl'),
+)
+
+# set visualizer
+vis_backends = [dict(type='LocalVisBackend')] # use local HDD backend
+visualizer = dict(
+ type='UniversalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+# set log level
+log_level = 'INFO'
+
+# load from which checkpoint
+load_from = None
+
+# whether to resume training from the loaded checkpoint
+resume = False
+```
+
+## Inherit and Modify Config File
+
+For easy understanding, we recommend contributors inherit from existing config files. But do not abuse the
+inheritance. Usually, for all config files, we recommend the maximum inheritance level is 3.
+
+For example, if your config file is based on ResNet with some other modification, you can first inherit the
+basic ResNet structure, dataset and other training settings by specifying `_base_ ='./resnet50_8xb32_in1k.py'`
+(The path relative to your config file), and then modify the necessary parameters in the config file. A more
+specific example, now we want to use almost all configs in `configs/resnet/resnet50_8xb32_in1k.py`, but using
+`CutMix` train batch augment and changing the number of training epochs from 100 to 300, modify when to decay
+the learning rate, and modify the dataset path, you can create a new config file
+`configs/resnet/resnet50_8xb32-300e_in1k.py` with content as below:
+
+```python
+# create this file under 'configs/resnet/' folder
+_base_ = './resnet50_8xb32_in1k.py'
+
+# using CutMix batch augment
+model = dict(
+ train_cfg=dict(
+ augments=dict(type='CutMix', alpha=1.0)
+ )
+)
+
+# trains more epochs
+train_cfg = dict(max_epochs=300, val_interval=10) # Train for 300 epochs, evaluate every 10 epochs
+param_scheduler = dict(step=[150, 200, 250]) # The learning rate adjustment has also changed
+
+# Use your own dataset directory
+train_dataloader = dict(
+ dataset=dict(data_root='mydata/imagenet/train'),
+)
+val_dataloader = dict(
+ batch_size=64, # No back-propagation during validation, larger batch size can be used
+ dataset=dict(data_root='mydata/imagenet/val'),
+)
+test_dataloader = dict(
+ batch_size=64, # No back-propagation during test, larger batch size can be used
+ dataset=dict(data_root='mydata/imagenet/val'),
+)
+```
+
+### Use intermediate variables in configs
+
+Some intermediate variables are used in the configuration file. The intermediate variables make the configuration file clearer and easier to modify.
+
+For example, `train_pipeline` / `test_pipeline` is the intermediate variable of the data pipeline. We first need to define `train_pipeline` / `test_pipeline`, and then pass them to `train_dataloader` / `test_dataloader`. If you want to modify the size of the input image during training and testing, you need to modify the intermediate variables of `train_pipeline` / `test_pipeline`.
+
+```python
+bgr_mean = [103.53, 116.28, 123.675] # mean in BGR order
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224, backend='pillow', interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=6,
+ magnitude_std=0.5,
+ hparams=dict(pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(type='PackInputs'),
+]
+
+test_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='ResizeEdge', scale=236, edge='short', backend='pillow', interpolation='bicubic'),
+ dict(type='CenterCrop', crop_size=224),
+ dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=val_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=val_pipeline))
+```
+
+### Ignore some fields in the base configs
+
+Sometimes, you need to set `_delete_=True` to ignore some domain content in the basic configuration file. You can refer to the {external+mmengine:doc}`documentation in MMEngine ` for more instructions.
+
+The following is an example. If you want to use cosine schedule in the above ResNet50 case, just using inheritance and directly modifying it will report `get unexpected keyword 'step'` error, because the `'step'` field of the basic config in `param_scheduler` domain information is reserved, and you need to add `_delete_ =True` to ignore the content of `param_scheduler` related fields in the basic configuration file:
+
+```python
+_base_ = '../../configs/resnet/resnet50_8xb32_in1k.py'
+
+# the learning rate scheduler
+param_scheduler = dict(type='CosineAnnealingLR', by_epoch=True, _delete_=True)
+```
+
+### Use some fields in the base configs
+
+Sometimes, you may refer to some fields in the `_base_` config, to avoid duplication of definitions. You can refer to {external+mmengine:doc}`MMEngine ` for some more instructions.
+
+The following is an example of using auto augment in the training data preprocessing pipeline, refer to [`configs/resnest/resnest50_32xb64_in1k.py`](https://github.com/open-mmlab/mmpretrain/blob/main/configs/resnest/resnest50_32xb64_in1k.py). When defining `train_pipeline`, just add the definition file name of auto augment to `_base_`, and then use `_base_.auto_increasing_policies` to reference the variables in the primitive config:
+
+```python
+_base_ = [
+ '../_base_/models/resnest50.py', '../_base_/datasets/imagenet_bs64.py',
+ '../_base_/default_runtime.py', './_randaug_policies.py',
+]
+
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(
+ type='RandAugment',
+ policies=_base_.policies, # This uses the `policies` parameter in the primitive config.
+ num_policies=2,
+ magnitude_level=12),
+ dict(type='EfficientNetRandomCrop', scale=224, backend='pillow'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+ dict(
+ type='Lighting',
+ eigval=EIGVAL,
+ eigvec=EIGVEC,
+ alphastd=0.1,
+ to_rgb=False),
+ dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+```
+
+## Modify config in command
+
+When you use the script "tools/train.py" or "tools/test.py" to submit tasks or use some other tools, they can directly modify the content of the configuration file used by specifying the `--cfg-options` argument.
+
+- Update config keys of dict chains.
+
+ The config options can be specified following the order of the dict keys in the original config.
+ For example, `--cfg-options model.backbone.norm_eval=False` changes the all BN modules in model backbones to `train` mode.
+
+- Update keys inside a list of configs.
+
+ Some config dicts are composed as a list in your config. For example, the training pipeline `data.train.pipeline` is normally a list
+ e.g. `[dict(type='LoadImageFromFile'), dict(type='TopDownRandomFlip', flip_prob=0.5), ...]`. If you want to change `'flip_prob=0.5'` to `'flip_prob=0.0'` in the pipeline,
+ you may specify `--cfg-options data.train.pipeline.1.flip_prob=0.0`.
+
+- Update values of list/tuples.
+
+ If the value to be updated is a list or a tuple. For example, the config file normally sets `val_evaluator = dict(type='Accuracy', topk=(1, 5))`. If you want to change the field `topk`, you may specify `--cfg-options val_evaluator.topk="(1,3)"`. Note that the quotation mark " is necessary to support list/tuple data types and that **NO** white space is allowed inside the quotation marks in the specified value.
diff --git a/docs/en/user_guides/dataset_prepare.md b/docs/en/user_guides/dataset_prepare.md
new file mode 100644
index 0000000000000000000000000000000000000000..17ec229b86693ce50f3c989f45a556da5b696260
--- /dev/null
+++ b/docs/en/user_guides/dataset_prepare.md
@@ -0,0 +1,364 @@
+# Prepare Dataset
+
+## CustomDataset
+
+[`CustomDataset`](mmpretrain.datasets.CustomDataset) is a general dataset class for you to use your own datasets. To use `CustomDataset`, you need to organize your dataset files according to the following two formats:
+
+### Subfolder Format
+
+In this format, you only need to re-organize your dataset folder and place all samples in one folder without
+creating any annotation files.
+
+For supervised tasks (with `with_label=True`), we use the name of sub-folders as the categories names, as
+shown in the below example, `class_x` and `class_y` will be recognized as the categories names.
+
+```text
+data_prefix/
+├── class_x
+│ ├── xxx.png
+│ ├── xxy.png
+│ └── ...
+│ └── xxz.png
+└── class_y
+ ├── 123.png
+ ├── nsdf3.png
+ ├── ...
+ └── asd932_.png
+```
+
+For unsupervised tasks (with `with_label=False`), we directly load all sample files under the specified folder:
+
+```text
+data_prefix/
+├── folder_1
+│ ├── xxx.png
+│ ├── xxy.png
+│ └── ...
+├── 123.png
+├── nsdf3.png
+└── ...
+```
+
+Assume you want to use it as the training dataset, and the below is the configurations in your config file.
+
+```python
+train_dataloader = dict(
+ ...
+ # Training dataset configurations
+ dataset=dict(
+ type='CustomDataset',
+ data_prefix='path/to/data_prefix',
+ with_label=True, # or False for unsupervised tasks
+ pipeline=...
+ )
+)
+```
+
+```{note}
+If you want to use this format, do not specify `ann_file`, or specify `ann_file=''`.
+
+And please note that the subfolder format requires a folder scanning which may cause a slower initialization,
+especially for large datasets or slow file IO.
+```
+
+### Text Annotation File Format
+
+In this format, we use a text annotation file to store image file paths and the corespondding category
+indices.
+
+For supervised tasks (with `with_label=True`), the annotation file should include the file path and the
+category index of one sample in one line and split them by a space, as below:
+
+All these file paths can be absolute paths, or paths relative to the `data_prefix`.
+
+```text
+folder_1/xxx.png 0
+folder_1/xxy.png 1
+123.png 4
+nsdf3.png 3
+...
+```
+
+```{note}
+The index numbers of categories start from 0. And the value of ground-truth labels should fall in range `[0, num_classes - 1]`.
+
+In addition, please use the `classes` field in the dataset settings to specify the name of every category.
+```
+
+For unsupervised tasks (with `with_label=False`), the annotation file only need to include the file path of
+one sample in one line, as below:
+
+```text
+folder_1/xxx.png
+folder_1/xxy.png
+123.png
+nsdf3.png
+...
+```
+
+Assume the entire dataset folder is as below:
+
+```text
+data_root
+├── meta
+│ ├── test.txt # The annotation file for the test dataset
+│ ├── train.txt # The annotation file for the training dataset
+│ └── val.txt # The annotation file for the validation dataset.
+├── train
+│ ├── 123.png
+│ ├── folder_1
+│ │ ├── xxx.png
+│ │ └── xxy.png
+│ └── nsdf3.png
+├── test
+└── val
+```
+
+Here is an example dataset settings in config files:
+
+```python
+# Training dataloader configurations
+train_dataloader = dict(
+ dataset=dict(
+ type='CustomDataset',
+ data_root='path/to/data_root', # The common prefix of both `ann_flie` and `data_prefix`.
+ ann_file='meta/train.txt', # The path of annotation file relative to the data_root.
+ data_prefix='train', # The prefix of file paths in the `ann_file`, relative to the data_root.
+ with_label=True, # or False for unsupervised tasks
+ classes=['A', 'B', 'C', 'D', ...], # The name of every category.
+ pipeline=..., # The transformations to process the dataset samples.
+ )
+ ...
+)
+```
+
+```{note}
+For a complete example about how to use the `CustomDataset`, please see [How to Pretrain with Custom Dataset](../notes/pretrain_custom_dataset.md)
+```
+
+## ImageNet
+
+ImageNet has multiple versions, but the most commonly used one is [ILSVRC 2012](http://www.image-net.org/challenges/LSVRC/2012/). It can be accessed with the following steps.
+
+`````{tabs}
+
+````{group-tab} Download by MIM
+
+MIM supports downloading from [OpenXlab](https://openxlab.org.cn/datasets) and preprocessing ImageNet dataset with one command line.
+
+_You need to register an account at [OpenXlab official website](https://openxlab.org.cn/datasets) and login by CLI._
+
+```Bash
+# install OpenXlab CLI tools
+pip install -U openxlab
+# log in OpenXLab
+openxlab login
+# download and preprocess by MIM, better to execute in $MMPreTrain directory.
+mim download mmpretrain --dataset imagenet1k
+```
+
+````
+
+````{group-tab} Download form Official Source
+
+1. Register an account and login to the [download page](http://www.image-net.org/download-images).
+2. Find download links for ILSVRC2012 and download the following two files
+ - ILSVRC2012_img_train.tar (~138GB)
+ - ILSVRC2012_img_val.tar (~6.3GB)
+3. Untar the downloaded files
+
+````
+
+`````
+
+### The Directory Structrue of the ImageNet dataset
+
+We support two ways of organizing the ImageNet dataset: Subfolder Format and Text Annotation File Format.
+
+#### Subfolder Format
+
+We have provided a sample, which you can download and extract from this [link](https://download.openmmlab.com/mmpretrain/datasets/imagenet_1k.zip). The directory structure of the dataset should be as below:
+
+```text
+data/imagenet/
+├── train/
+│ ├── n01440764
+│ │ ├── n01440764_10026.JPEG
+│ │ ├── n01440764_10027.JPEG
+│ │ ├── n01440764_10029.JPEG
+│ │ ├── n01440764_10040.JPEG
+│ │ ├── n01440764_10042.JPEG
+│ │ ├── n01440764_10043.JPEG
+│ │ └── n01440764_10048.JPEG
+│ ├── ...
+├── val/
+│ ├── n01440764
+│ │ ├── ILSVRC2012_val_00000293.JPEG
+│ │ ├── ILSVRC2012_val_00002138.JPEG
+│ │ ├── ILSVRC2012_val_00003014.JPEG
+│ │ └── ...
+│ ├── ...
+```
+
+#### Text Annotation File Format
+
+You can download and untar the meta data from this [link](https://download.openmmlab.com/mmclassification/datasets/imagenet/meta/caffe_ilsvrc12.tar.gz). And re-organize the dataset as below:
+
+```text
+data/imagenet/
+├── meta/
+│ ├── train.txt
+│ ├── test.txt
+│ └── val.txt
+├── train/
+│ ├── n01440764
+│ │ ├── n01440764_10026.JPEG
+│ │ ├── n01440764_10027.JPEG
+│ │ ├── n01440764_10029.JPEG
+│ │ ├── n01440764_10040.JPEG
+│ │ ├── n01440764_10042.JPEG
+│ │ ├── n01440764_10043.JPEG
+│ │ └── n01440764_10048.JPEG
+│ ├── ...
+├── val/
+│ ├── ILSVRC2012_val_00000001.JPEG
+│ ├── ILSVRC2012_val_00000002.JPEG
+│ ├── ILSVRC2012_val_00000003.JPEG
+│ ├── ILSVRC2012_val_00000004.JPEG
+│ ├── ...
+```
+
+### Configuration
+
+Once your dataset is organized in the way described above, you can use the [`ImageNet`](mmpretrain.datasets.ImageNet) dataset with the below configurations:
+
+```python
+train_dataloader = dict(
+ ...
+ # Training dataset configurations
+ dataset=dict(
+ type='ImageNet',
+ data_root='data/imagenet',
+ split='train',
+ pipeline=...,
+ )
+)
+
+val_dataloader = dict(
+ ...
+ # Validation dataset configurations
+ dataset=dict(
+ type='ImageNet',
+ data_root='data/imagenet',
+ split='val',
+ pipeline=...,
+ )
+)
+
+test_dataloader = val_dataloader
+```
+
+## Supported Image Classification Datasets
+
+| Datasets | split | HomePage |
+| ---------------------------------------------------------------------------------- | :---------------------------------- | ----------------------------------------------------------------------------------- |
+| [`Calthch101`](mmpretrain.datasets.Caltech101)(data_root[, split, pipeline, ...]) | ["train", "test"] | [Caltech 101](https://data.caltech.edu/records/mzrjq-6wc02) Dataset. |
+| [`CIFAR10`](mmpretrain.datasets.CIFAR10)(data_root[, split, pipeline, ...]) | ["train", "test"] | [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) Dataset. |
+| [`CIFAR100`](mmpretrain.datasets.CIFAR100)(data_root[, split, pipeline, ...]) | ["train", "test"] | [CIFAR100](https://www.cs.toronto.edu/~kriz/cifar.html) Dataset. |
+| [`CUB`](mmpretrain.datasets.CUB)(data_root[, split, pipeline, ...]) | ["train", "test"] | [CUB-200-2011](http://www.vision.caltech.edu/datasets/cub_200_2011/) Dataset. |
+| [`DTD`](mmpretrain.datasets.DTD)(data_root[, split, pipeline, ...]) | ["train", "val", "tranval", "test"] | [Describable Texture Dataset (DTD)](https://www.robots.ox.ac.uk/~vgg/data/dtd/) Dataset. |
+| [`FashionMNIST`](mmpretrain.datasets.FashionMNIST) (data_root[, split, pipeline, ...]) | ["train", "test"] | [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) Dataset. |
+| [`FGVCAircraft`](mmpretrain.datasets.FGVCAircraft)(data_root[, split, pipeline, ...]) | ["train", "val", "tranval", "test"] | [FGVC Aircraft](https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/) Dataset. |
+| [`Flowers102`](mmpretrain.datasets.Flowers102)(data_root[, split, pipeline, ...]) | ["train", "val", "tranval", "test"] | [Oxford 102 Flower](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/) Dataset. |
+| [`Food101`](mmpretrain.datasets.Food101)(data_root[, split, pipeline, ...]) | ["train", "test"] | [Food101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) Dataset. |
+| [`MNIST`](mmpretrain.datasets.MNIST) (data_root[, split, pipeline, ...]) | ["train", "test"] | [MNIST](http://yann.lecun.com/exdb/mnist/) Dataset. |
+| [`OxfordIIITPet`](mmpretrain.datasets.OxfordIIITPet)(data_root[, split, pipeline, ...]) | ["tranval", test"] | [Oxford-IIIT Pets](https://www.robots.ox.ac.uk/~vgg/data/pets/) Dataset. |
+| [`Places205`](mmpretrain.datasets.Places205)(data_root[, pipeline, ...]) | - | [Places205](http://places.csail.mit.edu/downloadData.html) Dataset. |
+| [`StanfordCars`](mmpretrain.datasets.StanfordCars)(data_root[, split, pipeline, ...]) | ["train", "test"] | [Stanford Cars](https://ai.stanford.edu/~jkrause/cars/car_dataset.html) Dataset. |
+| [`SUN397`](mmpretrain.datasets.SUN397)(data_root[, split, pipeline, ...]) | ["train", "test"] | [SUN397](https://vision.princeton.edu/projects/2010/SUN/) Dataset. |
+| [`VOC`](mmpretrain.datasets.VOC)(data_root[, image_set_path, pipeline, ...]) | ["train", "val", "tranval", "test"] | [Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/) Dataset. |
+
+Some dataset homepage links may be unavailable, and you can download datasets through [OpenXLab](https://openxlab.org.cn/datasets), such as [Stanford Cars](https://openxlab.org.cn/datasets/OpenDataLab/Stanford_Cars).
+
+## Supported Multi-modality Datasets
+
+| Datasets | split | HomePage |
+| --------------------------------------------------------------------------------------------- | :----------------------- | ----------------------------------------------------------------------------------- |
+| [`RefCOCO`](mmpretrain.datasets.RefCOCO)(data_root, ann_file, data_prefix, split_file[, split, ...]) | ["train", "val", "test"] | [RefCOCO](https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip) Dataset. |
+
+Some dataset homepage links may be unavailable, and you can download datasets through [OpenDataLab](https://opendatalab.com/), such as [RefCOCO](https://opendatalab.com/RefCOCO/download).
+
+## OpenMMLab 2.0 Standard Dataset
+
+In order to facilitate the training of multi-task algorithm models, we unify the dataset interfaces of different tasks. OpenMMLab has formulated the **OpenMMLab 2.0 Dataset Format Specification**. When starting a trainning task, the users can choose to convert their dataset annotation into the specified format, and use the algorithm library of OpenMMLab to perform algorithm training and testing based on the data annotation file.
+
+The OpenMMLab 2.0 Dataset Format Specification stipulates that the annotation file must be in `json` or `yaml`, `yml`, `pickle` or `pkl` format; the dictionary stored in the annotation file must contain `metainfo` and `data_list` fields, The value of `metainfo` is a dictionary, which contains the meta information of the dataset; and the value of `data_list` is a list, each element in the list is a dictionary, the dictionary defines a raw data, each raw data contains a or several training/testing samples.
+
+The following is an example of a JSON annotation file (in this example each raw data contains only one train/test sample):
+
+```
+{
+ 'metainfo':
+ {
+ 'classes': ('cat', 'dog'), # the category index of 'cat' is 0 and 'dog' is 1.
+ ...
+ },
+ 'data_list':
+ [
+ {
+ 'img_path': "xxx/xxx_0.jpg",
+ 'gt_label': 0,
+ ...
+ },
+ {
+ 'img_path': "xxx/xxx_1.jpg",
+ 'gt_label': 1,
+ ...
+ },
+ ...
+ ]
+}
+```
+
+Assume you want to use the training dataset and the dataset is stored as the below structure:
+
+```text
+data
+├── annotations
+│ ├── train.json
+├── train
+│ ├── xxx/xxx_0.jpg
+│ ├── xxx/xxx_1.jpg
+│ ├── ...
+```
+
+Build from the following dictionaries:
+
+```python
+train_dataloader = dict(
+ ...
+ dataset=dict(
+ type='BaseDataset',
+ data_root='data',
+ ann_file='annotations/train.json',
+ data_prefix='train/',
+ pipeline=...,
+ )
+)
+```
+
+## Other Datasets
+
+To find more datasets supported by MMPretrain, and get more configurations of the above datasets, please see the [dataset documentation](mmpretrain.datasets).
+
+To implement your own dataset class for some special formats, please see the [Adding New Dataset](../advanced_guides/datasets.md).
+
+## Dataset Wrappers
+
+The following datawrappers are supported in MMEngine, you can refer to {external+mmengine:doc}`MMEngine tutorial ` to learn how to use it.
+
+- {external:py:class}`~mmengine.dataset.ConcatDataset`
+- {external:py:class}`~mmengine.dataset.RepeatDataset`
+- {external:py:class}`~mmengine.dataset.ClassBalancedDataset`
+
+The MMPretrain also support [KFoldDataset](mmpretrain.datasets.KFoldDataset), please use it with `tools/kfold-cross-valid.py`.
diff --git a/docs/en/user_guides/downstream.md b/docs/en/user_guides/downstream.md
new file mode 100644
index 0000000000000000000000000000000000000000..9abb077ae9b98b25054441a618d14b34406c2d2c
--- /dev/null
+++ b/docs/en/user_guides/downstream.md
@@ -0,0 +1,128 @@
+# Downstream tasks
+
+## Detection
+
+For detection tasks, please use MMDetection. First, make sure you have installed [MIM](https://github.com/open-mmlab/mim), which is also a project of OpenMMLab.
+
+```shell
+pip install openmim
+mim install 'mmdet>=3.0.0rc0'
+```
+
+Besides, please refer to MMDet for [installation](https://mmdetection.readthedocs.io/en/dev-3.x/get_started.html) and [data preparation](https://mmdetection.readthedocs.io/en/dev-3.x/user_guides/dataset_prepare.html)
+
+### Train
+
+After installation, you can run MMDetection with simple command.
+
+```shell
+# distributed version
+bash tools/benchmarks/mmdetection/mim_dist_train_c4.sh ${CONFIG} ${PRETRAIN} ${GPUS}
+bash tools/benchmarks/mmdetection/mim_dist_train_fpn.sh ${CONFIG} ${PRETRAIN} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmdetection/mim_slurm_train_c4.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
+bash tools/benchmarks/mmdetection/mim_slurm_train_fpn.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
+```
+
+- `${CONFIG}`: Use config file path in MMDetection directly. And for some algorithms, we also have some
+ modified config files which can be found in the `benchmarks` folder under the correspondding algorithm
+ folder. You can also writing your config file from scratch.
+- `${PRETRAIN}`: the pre-trained model file.
+- `${GPUS}`: The number of GPUs that you want to use to train. We adopt 8 GPUs for detection tasks by default.
+
+Example:
+
+```shell
+bash ./tools/benchmarks/mmdetection/mim_dist_train_c4.sh \
+ configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py \
+ https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 8
+```
+
+### Test
+
+After training, you can also run the command below to test your model.
+
+```shell
+# distributed version
+bash tools/benchmarks/mmdetection/mim_dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmdetection/mim_slurm_test.sh ${PARTITION} ${CONFIG} ${CHECKPOINT}
+```
+
+- `${CONFIG}`: Use config file name in MMDetection directly. And for some algorithms, we also have some
+ modified config files which can be found in the `benchmarks` folder under the correspondding algorithm
+ folder. You can also writing your config file from scratch.
+- `${CHECKPOINT}`: The fine-tuned detection model that you want to test.
+- `${GPUS}`: The number of GPUs that you want to use to test. We adopt 8 GPUs for detection tasks by default.
+
+Example:
+
+```shell
+bash ./tools/benchmarks/mmdetection/mim_dist_test.sh \
+configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py \
+https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 8
+```
+
+## Segmentation
+
+For semantic segmentation task, we use MMSegmentation. First, make sure you have installed [MIM](https://github.com/open-mmlab/mim), which is also a project of OpenMMLab.
+
+```shell
+pip install openmim
+mim install 'mmsegmentation>=1.0.0rc0'
+```
+
+Besides, please refer to MMSegmentation for [installation](https://mmsegmentation.readthedocs.io/en/dev-1.x/get_started.html) and [data preparation](https://mmsegmentation.readthedocs.io/en/dev-1.x/user_guides/2_dataset_prepare.html).
+
+### Train
+
+After installation, you can run MMSegmentation with simple command.
+
+```shell
+# distributed version
+bash tools/benchmarks/mmsegmentation/mim_dist_train.sh ${CONFIG} ${PRETRAIN} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmsegmentation/mim_slurm_train.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
+```
+
+- `${CONFIG}`: Use config file path in MMSegmentation directly. And for some algorithms, we also have some
+ modified config files which can be found in the `benchmarks` folder under the correspondding algorithm
+ folder. You can also writing your config file from scratch.
+- `${PRETRAIN}`: the pre-trained model file.
+- `${GPUS}`: The number of GPUs that you want to use to train. We adopt 4 GPUs for segmentation tasks by default.
+
+Example:
+
+```shell
+bash ./tools/benchmarks/mmsegmentation/mim_dist_train.sh \
+configs/benchmarks/mmsegmentation/voc12aug/fcn_r50-d8_4xb4-20k_voc12aug-512x512.py \
+https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 4
+```
+
+### Test
+
+After training, you can also run the command below to test your model.
+
+```shell
+# distributed version
+bash tools/benchmarks/mmsegmentation/mim_dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmsegmentation/mim_slurm_test.sh ${PARTITION} ${CONFIG} ${CHECKPOINT}
+```
+
+- `${CONFIG}`: Use config file name in MMSegmentation directly. And for some algorithms, we also have some
+ modified config files which can be found in the `benchmarks` folder under the correspondding algorithm
+ folder. You can also writing your config file from scratch.
+- `${CHECKPOINT}`: The fine-tuned segmentation model that you want to test.
+- `${GPUS}`: The number of GPUs that you want to use to test. We adopt 4 GPUs for segmentation tasks by default.
+
+Example:
+
+```shell
+bash ./tools/benchmarks/mmsegmentation/mim_dist_test.sh fcn_r50-d8_4xb4-20k_voc12aug-512x512.py \
+https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 4
+```
diff --git a/docs/en/user_guides/inference.md b/docs/en/user_guides/inference.md
new file mode 100644
index 0000000000000000000000000000000000000000..8d6cbefb67d9e7790627d566fca1a89cfd9bcfe2
--- /dev/null
+++ b/docs/en/user_guides/inference.md
@@ -0,0 +1,179 @@
+# Inference with existing models
+
+This tutorial will show how to use the following APIs:
+
+- [**`list_models`**](mmpretrain.apis.list_models): List available model names in MMPreTrain.
+- [**`get_model`**](mmpretrain.apis.get_model): Get a model from model name or model config.
+- [**`inference_model`**](mmpretrain.apis.inference_model): Inference a model with the correspondding
+ inferencer. It's a shortcut for a quick start, and for advanced usage, please use the below inferencer
+ directly.
+- Inferencers:
+ 1. [**`ImageClassificationInferencer`**](mmpretrain.apis.ImageClassificationInferencer):
+ Perform image classification on the given image.
+ 2. [**`ImageRetrievalInferencer`**](mmpretrain.apis.ImageRetrievalInferencer):
+ Perform image-to-image retrieval from the given image on a given image set.
+ 3. [**`ImageCaptionInferencer`**](mmpretrain.apis.ImageCaptionInferencer):
+ Generate a caption on the given image.
+ 4. [**`VisualQuestionAnsweringInferencer`**](mmpretrain.apis.VisualQuestionAnsweringInferencer):
+ Answer a question according to the given image.
+ 5. [**`VisualGroundingInferencer`**](mmpretrain.apis.VisualGroundingInferencer):
+ Locate an object from the description on the given image.
+ 6. [**`TextToImageRetrievalInferencer`**](mmpretrain.apis.TextToImageRetrievalInferencer):
+ Perform text-to-image retrieval from the given description on a given image set.
+ 7. [**`ImageToTextRetrievalInferencer`**](mmpretrain.apis.ImageToTextRetrievalInferencer):
+ Perform image-to-text retrieval from the given image on a series of text.
+ 8. [**`NLVRInferencer`**](mmpretrain.apis.NLVRInferencer):
+ Perform Natural Language for Visual Reasoning on a given image-pair and text.
+ 9. [**`FeatureExtractor`**](mmpretrain.apis.FeatureExtractor):
+ Extract features from the image files by a vision backbone.
+
+## List available models
+
+list all the models in MMPreTrain.
+
+```python
+>>> from mmpretrain import list_models
+>>> list_models()
+['barlowtwins_resnet50_8xb256-coslr-300e_in1k',
+ 'beit-base-p16_beit-in21k-pre_3rdparty_in1k',
+ ...]
+```
+
+`list_models` supports Unix filename pattern matching, you can use \*\* * \*\* to match any character.
+
+```python
+>>> from mmpretrain import list_models
+>>> list_models("*convnext-b*21k")
+['convnext-base_3rdparty_in21k',
+ 'convnext-base_in21k-pre-3rdparty_in1k-384px',
+ 'convnext-base_in21k-pre_3rdparty_in1k']
+```
+
+You can use the `list_models` method of inferencers to get the available models of the correspondding tasks.
+
+```python
+>>> from mmpretrain import ImageCaptionInferencer
+>>> ImageCaptionInferencer.list_models()
+['blip-base_3rdparty_caption',
+ 'blip2-opt2.7b_3rdparty-zeroshot_caption',
+ 'flamingo_3rdparty-zeroshot_caption',
+ 'ofa-base_3rdparty-finetuned_caption']
+```
+
+## Get a model
+
+you can use `get_model` get the model.
+
+```python
+>>> from mmpretrain import get_model
+
+# Get model without loading pre-trained weight.
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k")
+
+# Get model and load the default checkpoint.
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k", pretrained=True)
+
+# Get model and load the specified checkpoint.
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k", pretrained="your_local_checkpoint_path")
+
+# Get model with extra initialization arguments, for example, modify the num_classes in head.
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k", head=dict(num_classes=10))
+
+# Another example, remove the neck and head, and output from stage 1, 2, 3 in backbone
+>>> model_headless = get_model("resnet18_8xb32_in1k", head=None, neck=None, backbone=dict(out_indices=(1, 2, 3)))
+```
+
+The obtained model is a usual PyTorch module.
+
+```python
+>>> import torch
+>>> from mmpretrain import get_model
+>>> model = get_model('convnext-base_in21k-pre_3rdparty_in1k', pretrained=True)
+>>> x = torch.rand((1, 3, 224, 224))
+>>> y = model(x)
+>>> print(type(y), y.shape)
+ torch.Size([1, 1000])
+```
+
+## Inference on given images
+
+Here is an example to inference an [image](https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG) by the ResNet-50 pre-trained classification model.
+
+```python
+>>> from mmpretrain import inference_model
+>>> image = 'https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG'
+>>> # If you have no graphical interface, please set `show=False`
+>>> result = inference_model('resnet50_8xb32_in1k', image, show=True)
+>>> print(result['pred_class'])
+sea snake
+```
+
+The `inference_model` API is only for demo and cannot keep the model instance or inference on multiple
+samples. You can use the inferencers for multiple calling.
+
+```python
+>>> from mmpretrain import ImageClassificationInferencer
+>>> image = 'https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG'
+>>> inferencer = ImageClassificationInferencer('resnet50_8xb32_in1k')
+>>> # Note that the inferencer output is a list of result even if the input is a single sample.
+>>> result = inferencer('https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG')[0]
+>>> print(result['pred_class'])
+sea snake
+>>>
+>>> # You can also use is for multiple images.
+>>> image_list = ['demo/demo.JPEG', 'demo/bird.JPEG'] * 16
+>>> results = inferencer(image_list, batch_size=8)
+>>> print(len(results))
+32
+>>> print(results[1]['pred_class'])
+house finch, linnet, Carpodacus mexicanus
+```
+
+Usually, the result for every sample is a dictionary. For example, the image classification result is a dictionary containing `pred_label`, `pred_score`, `pred_scores` and `pred_class` as follows:
+
+```python
+{
+ "pred_label": 65,
+ "pred_score": 0.6649366617202759,
+ "pred_class":"sea snake",
+ "pred_scores": array([..., 0.6649366617202759, ...], dtype=float32)
+}
+```
+
+You can configure the inferencer by arguments, for example, use your own config file and checkpoint to
+inference images by CUDA.
+
+```python
+>>> from mmpretrain import ImageClassificationInferencer
+>>> image = 'https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG'
+>>> config = 'configs/resnet/resnet50_8xb32_in1k.py'
+>>> checkpoint = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth'
+>>> inferencer = ImageClassificationInferencer(model=config, pretrained=checkpoint, device='cuda')
+>>> result = inferencer(image)[0]
+>>> print(result['pred_class'])
+sea snake
+```
+
+## Inference by a Gradio demo
+
+We also provide a gradio demo for all supported tasks and you can find it in [projects/gradio_demo/launch.py](https://github.com/open-mmlab/mmpretrain/blob/main/projects/gradio_demo/launch.py).
+
+Please install `gradio` by `pip install -U gradio` at first.
+
+Here is the interface preview:
+
+
+
+## Extract Features From Image
+
+Compared with `model.extract_feat`, `FeatureExtractor` is used to extract features from the image files directly, instead of a batch of tensors.
+In a word, the input of `model.extract_feat` is `torch.Tensor`, the input of `FeatureExtractor` is images.
+
+```python
+>>> from mmpretrain import FeatureExtractor, get_model
+>>> model = get_model('resnet50_8xb32_in1k', backbone=dict(out_indices=(0, 1, 2, 3)))
+>>> extractor = FeatureExtractor(model)
+>>> features = extractor('https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG')[0]
+>>> features[0].shape, features[1].shape, features[2].shape, features[3].shape
+(torch.Size([256]), torch.Size([512]), torch.Size([1024]), torch.Size([2048]))
+```
diff --git a/docs/en/user_guides/test.md b/docs/en/user_guides/test.md
new file mode 100644
index 0000000000000000000000000000000000000000..65ec073ea96762a0e5c6c850b7bdbd3fd3e67dac
--- /dev/null
+++ b/docs/en/user_guides/test.md
@@ -0,0 +1,123 @@
+# Test
+
+For image classification task and image retrieval task, you could test your model after training.
+
+## Test with your PC
+
+You can use `tools/test.py` to test a model on a single machine with a CPU and optionally a GPU.
+
+Here is the full usage of the script:
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
+```
+
+````{note}
+By default, MMPretrain prefers GPU to CPU. If you want to test a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program.
+
+```bash
+CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
+```
+````
+
+| ARGS | Description |
+| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `CONFIG_FILE` | The path to the config file. |
+| `CHECKPOINT_FILE` | The path to the checkpoint file (It can be a http link, and you can find checkpoints [here](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html)). |
+| `--work-dir WORK_DIR` | The directory to save the file containing evaluation metrics. |
+| `--out OUT` | The path to save the file containing test results. |
+| `--out-item OUT_ITEM` | To specify the content of the test results file, and it can be "pred" or "metrics". If "pred", save the outputs of the model for offline evaluation. If "metrics", save the evaluation metrics. Defaults to "pred". |
+| `--cfg-options CFG_OPTIONS` | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed. |
+| `--show-dir SHOW_DIR` | The directory to save the result visualization images. |
+| `--show` | Visualize the prediction result in a window. |
+| `--interval INTERVAL` | The interval of samples to visualize. |
+| `--wait-time WAIT_TIME` | The display time of every window (in seconds). Defaults to 1. |
+| `--no-pin-memory` | Whether to disable the `pin_memory` option in dataloaders. |
+| `--tta` | Whether to enable the Test-Time-Aug (TTA). If the config file has `tta_pipeline` and `tta_model` fields, use them to determine the TTA transforms and how to merge the TTA results. Otherwise, use flip TTA by averaging classification score. |
+| `--launcher {none,pytorch,slurm,mpi}` | Options for job launcher. |
+
+## Test with multiple GPUs
+
+We provide a shell script to start a multi-GPUs task with `torch.distributed.launch`.
+
+```shell
+bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+| ARGS | Description |
+| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `CONFIG_FILE` | The path to the config file. |
+| `CHECKPOINT_FILE` | The path to the checkpoint file (It can be a http link, and you can find checkpoints [here](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html)). |
+| `GPU_NUM` | The number of GPUs to be used. |
+| `[PY_ARGS]` | The other optional arguments of `tools/test.py`, see [here](#test-with-your-pc). |
+
+You can also specify extra arguments of the launcher by environment variables. For example, change the
+communication port of the launcher to 29666 by the below command:
+
+```shell
+PORT=29666 bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+If you want to startup multiple test jobs and use different GPUs, you can launch them by specifying
+different port and visible devices.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_test.sh ${CONFIG_FILE1} ${CHECKPOINT_FILE} 4 [PY_ARGS]
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_test.sh ${CONFIG_FILE2} ${CHECKPOINT_FILE} 4 [PY_ARGS]
+```
+
+## Test with multiple machines
+
+### Multiple machines in the same network
+
+If you launch a test job with multiple machines connected with ethernet, you can run the following commands:
+
+On the first machine:
+
+```shell
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS
+```
+
+On the second machine:
+
+```shell
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS
+```
+
+Comparing with multi-GPUs in a single machine, you need to specify some extra environment variables:
+
+| ENV_VARS | Description |
+| ------------- | ---------------------------------------------------------------------------- |
+| `NNODES` | The total number of machines. |
+| `NODE_RANK` | The index of the local machine. |
+| `PORT` | The communication port, it should be the same in all machines. |
+| `MASTER_ADDR` | The IP address of the master machine, it should be the same in all machines. |
+
+Usually it is slow if you do not have high speed networking like InfiniBand.
+
+### Multiple machines managed with slurm
+
+If you run MMPretrain on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `tools/slurm_test.sh`.
+
+```shell
+[ENV_VARS] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]
+```
+
+Here are the arguments description of the script.
+
+| ARGS | Description |
+| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `PARTITION` | The partition to use in your cluster. |
+| `JOB_NAME` | The name of your job, you can name it as you like. |
+| `CONFIG_FILE` | The path to the config file. |
+| `CHECKPOINT_FILE` | The path to the checkpoint file (It can be a http link, and you can find checkpoints [here](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html)). |
+| `[PY_ARGS]` | The other optional arguments of `tools/test.py`, see [here](#test-with-your-pc). |
+
+Here are the environment variables can be used to configure the slurm job.
+
+| ENV_VARS | Description |
+| --------------- | ---------------------------------------------------------------------------------------------------------- |
+| `GPUS` | The number of GPUs to be used. Defaults to 8. |
+| `GPUS_PER_NODE` | The number of GPUs to be allocated per node. |
+| `CPUS_PER_TASK` | The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5. |
+| `SRUN_ARGS` | The other arguments of `srun`. Available options can be found [here](https://slurm.schedmd.com/srun.html). |
diff --git a/docs/en/user_guides/train.md b/docs/en/user_guides/train.md
new file mode 100644
index 0000000000000000000000000000000000000000..9cc618b038b4c44e46904ccca5c80731653ab1fc
--- /dev/null
+++ b/docs/en/user_guides/train.md
@@ -0,0 +1,121 @@
+# Train
+
+In this tutorial, we will introduce how to use the scripts provided in MMPretrain to start a training task. If
+you need, we also have some practice examples about [how to pretrain with custom dataset](../notes/pretrain_custom_dataset.md)
+and [how to finetune with custom dataset](../notes/finetune_custom_dataset.md).
+
+## Train with your PC
+
+You can use `tools/train.py` to train a model on a single machine with a CPU and optionally a GPU.
+
+Here is the full usage of the script:
+
+```shell
+python tools/train.py ${CONFIG_FILE} [ARGS]
+```
+
+````{note}
+By default, MMPretrain prefers GPU to CPU. If you want to train a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program.
+
+```bash
+CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS]
+```
+````
+
+| ARGS | Description |
+| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `CONFIG_FILE` | The path to the config file. |
+| `--work-dir WORK_DIR` | The target folder to save logs and checkpoints. Defaults to a folder with the same name of the config file under `./work_dirs`. |
+| `--resume [RESUME]` | Resume training. If specify a path, resume from it, while if not specify, try to auto resume from the latest checkpoint. |
+| `--amp` | Enable automatic-mixed-precision training. |
+| `--no-validate` | **Not suggested**. Disable checkpoint evaluation during training. |
+| `--auto-scale-lr` | Auto scale the learning rate according to the actual batch size and the original batch size. |
+| `--no-pin-memory` | Whether to disable the `pin_memory` option in dataloaders. |
+| `--no-persistent-workers` | Whether to disable the `persistent_workers` option in dataloaders. |
+| `--cfg-options CFG_OPTIONS` | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed. |
+| `--launcher {none,pytorch,slurm,mpi}` | Options for job launcher. |
+
+## Train with multiple GPUs
+
+We provide a shell script to start a multi-GPUs task with `torch.distributed.launch`.
+
+```shell
+bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+| ARGS | Description |
+| ------------- | ---------------------------------------------------------------------------------- |
+| `CONFIG_FILE` | The path to the config file. |
+| `GPU_NUM` | The number of GPUs to be used. |
+| `[PY_ARGS]` | The other optional arguments of `tools/train.py`, see [here](#train-with-your-pc). |
+
+You can also specify extra arguments of the launcher by environment variables. For example, change the
+communication port of the launcher to 29666 by the below command:
+
+```shell
+PORT=29666 bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+If you want to startup multiple training jobs and use different GPUs, you can launch them by specifying
+different ports and visible devices.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_train.sh ${CONFIG_FILE1} 4 [PY_ARGS]
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_train.sh ${CONFIG_FILE2} 4 [PY_ARGS]
+```
+
+## Train with multiple machines
+
+### Multiple machines in the same network
+
+If you launch a training job with multiple machines connected with ethernet, you can run the following commands:
+
+On the first machine:
+
+```shell
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
+```
+
+On the second machine:
+
+```shell
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
+```
+
+Comparing with multi-GPUs in a single machine, you need to specify some extra environment variables:
+
+| ENV_VARS | Description |
+| ------------- | ---------------------------------------------------------------------------- |
+| `NNODES` | The total number of machines. |
+| `NODE_RANK` | The index of the local machine. |
+| `PORT` | The communication port, it should be the same in all machines. |
+| `MASTER_ADDR` | The IP address of the master machine, it should be the same in all machines. |
+
+Usually it is slow if you do not have high speed networking like InfiniBand.
+
+### Multiple machines managed with slurm
+
+If you run MMPretrain on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `tools/slurm_train.sh`.
+
+```shell
+[ENV_VARS] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]
+```
+
+Here are the arguments description of the script.
+
+| ARGS | Description |
+| ------------- | ---------------------------------------------------------------------------------- |
+| `PARTITION` | The partition to use in your cluster. |
+| `JOB_NAME` | The name of your job, you can name it as you like. |
+| `CONFIG_FILE` | The path to the config file. |
+| `WORK_DIR` | The target folder to save logs and checkpoints. |
+| `[PY_ARGS]` | The other optional arguments of `tools/train.py`, see [here](#train-with-your-pc). |
+
+Here are the environment variables can be used to configure the slurm job.
+
+| ENV_VARS | Description |
+| --------------- | ---------------------------------------------------------------------------------------------------------- |
+| `GPUS` | The number of GPUs to be used. Defaults to 8. |
+| `GPUS_PER_NODE` | The number of GPUs to be allocated per node.. |
+| `CPUS_PER_TASK` | The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5. |
+| `SRUN_ARGS` | The other arguments of `srun`. Available options can be found [here](https://slurm.schedmd.com/srun.html). |
diff --git a/docs/zh_CN/Makefile b/docs/zh_CN/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..d4bb2cbb9eddb1bb1b4f366623044af8e4830919
--- /dev/null
+++ b/docs/zh_CN/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS ?=
+SPHINXBUILD ?= sphinx-build
+SOURCEDIR = .
+BUILDDIR = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+ @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/zh_CN/_static/css/readthedocs.css b/docs/zh_CN/_static/css/readthedocs.css
new file mode 100644
index 0000000000000000000000000000000000000000..39dc689e8a97b22a48e9d6badbb729faa4335d3c
--- /dev/null
+++ b/docs/zh_CN/_static/css/readthedocs.css
@@ -0,0 +1,61 @@
+.header-logo {
+ background-image: url("../image/mmpt-logo.png");
+ background-size: 183px 50px;
+ height: 50px;
+ width: 183px;
+}
+
+@media screen and (min-width: 1100px) {
+ .header-logo {
+ top: -12px;
+ }
+}
+
+pre {
+ white-space: pre;
+}
+
+@media screen and (min-width: 2000px) {
+ .pytorch-content-left {
+ width: 1200px;
+ margin-left: 30px;
+ }
+ article.pytorch-article {
+ max-width: 1200px;
+ }
+ .pytorch-breadcrumbs-wrapper {
+ width: 1200px;
+ }
+ .pytorch-right-menu.scrolling-fixed {
+ position: fixed;
+ top: 45px;
+ left: 1580px;
+ }
+}
+
+article.pytorch-article section code {
+ padding: .2em .4em;
+ background-color: #f3f4f7;
+ border-radius: 5px;
+}
+
+/* Disable the change in tables */
+article.pytorch-article section table code {
+ padding: unset;
+ background-color: unset;
+ border-radius: unset;
+}
+
+table.autosummary td {
+ width: 50%
+}
+
+img.align-center {
+ display: block;
+ margin-left: auto;
+ margin-right: auto;
+}
+
+article.pytorch-article p.rubric {
+ font-weight: bold;
+}
diff --git a/docs/zh_CN/_static/image/confusion-matrix.png b/docs/zh_CN/_static/image/confusion-matrix.png
new file mode 120000
index 0000000000000000000000000000000000000000..7b0b377272ca60968b14e3b30e5cb8545f13534b
--- /dev/null
+++ b/docs/zh_CN/_static/image/confusion-matrix.png
@@ -0,0 +1 @@
+../../../en/_static/image/confusion-matrix.png
\ No newline at end of file
diff --git a/docs/zh_CN/_static/image/mmpt-logo.png b/docs/zh_CN/_static/image/mmpt-logo.png
new file mode 100644
index 0000000000000000000000000000000000000000..f4e060716520ece5db7e85df3c3ad8fd9e0eda57
Binary files /dev/null and b/docs/zh_CN/_static/image/mmpt-logo.png differ
diff --git a/docs/zh_CN/_static/image/tools/analysis/analyze_log.jpg b/docs/zh_CN/_static/image/tools/analysis/analyze_log.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..8eb1a27d6464d255b84b23a7460a5f622f51712f
Binary files /dev/null and b/docs/zh_CN/_static/image/tools/analysis/analyze_log.jpg differ
diff --git a/docs/zh_CN/_static/js/custom.js b/docs/zh_CN/_static/js/custom.js
new file mode 100644
index 0000000000000000000000000000000000000000..96f0679385f616f29f7d7106f0507a5f120019be
--- /dev/null
+++ b/docs/zh_CN/_static/js/custom.js
@@ -0,0 +1,20 @@
+var collapsedSections = ['进阶教程', '模型库', '可视化', '分析工具', '部署', '其他说明'];
+
+$(document).ready(function () {
+ $('.model-summary').DataTable({
+ "stateSave": false,
+ "lengthChange": false,
+ "pageLength": 20,
+ "order": [],
+ "language": {
+ "info": "显示 _START_ 至 _END_ 条目(总计 _TOTAL_ )",
+ "infoFiltered": "(筛选自 _MAX_ 条目)",
+ "search": "搜索:",
+ "zeroRecords": "没有找到任何条目",
+ "paginate": {
+ "next": "下一页",
+ "previous": "上一页"
+ },
+ }
+ });
+});
diff --git a/docs/zh_CN/_templates/404.html b/docs/zh_CN/_templates/404.html
new file mode 100644
index 0000000000000000000000000000000000000000..abf3356cf4413269b82439f28b6884fc8e51376f
--- /dev/null
+++ b/docs/zh_CN/_templates/404.html
@@ -0,0 +1,16 @@
+{% extends "layout.html" %}
+
+{% block body %}
+
+未找到页面
+
+ 未找到你要打开的页面。
+
+
+ 如果你是从旧版本文档跳转至此,可能是对应的页面被移动了。请从左侧的目录中寻找新版本文档,或者跳转至首页。
+
+
+ 如果你找不到希望打开的文档,欢迎在 Issue 中告诉我们!
+
+
+{% endblock %}
diff --git a/docs/zh_CN/_templates/autosummary/class.rst b/docs/zh_CN/_templates/autosummary/class.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4c3a7a9abf5c5b14ac3ef3b00a2f070480295358
--- /dev/null
+++ b/docs/zh_CN/_templates/autosummary/class.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+ :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+ :members:
+
+..
+ autogenerated from _templates/autosummary/class.rst
+ note it does not have :inherited-members:
diff --git a/docs/zh_CN/_templates/callable.rst b/docs/zh_CN/_templates/callable.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3a7b9d2b96c76dfa3eb1d8bef56f58f219fe7760
--- /dev/null
+++ b/docs/zh_CN/_templates/callable.rst
@@ -0,0 +1,14 @@
+.. role:: hidden
+ :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+ :members:
+ :special-members: __call__
+
+..
+ autogenerated from _templates/callable.rst
+ note it does not have :inherited-members:
diff --git a/docs/zh_CN/_templates/data_transform.rst b/docs/zh_CN/_templates/data_transform.rst
new file mode 100644
index 0000000000000000000000000000000000000000..376bfe9db6c305e681f265dd0e20b7b7ea6e500f
--- /dev/null
+++ b/docs/zh_CN/_templates/data_transform.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+ :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+ :members: transform
+
+..
+ autogenerated from _templates/callable.rst
+ note it does not have :inherited-members:
diff --git a/docs/zh_CN/advanced_guides/convention.md b/docs/zh_CN/advanced_guides/convention.md
new file mode 100644
index 0000000000000000000000000000000000000000..941236b698bf0861d1547227c7671c76a59e3075
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/convention.md
@@ -0,0 +1,114 @@
+# MMPretrain 中的约定
+
+## 模型命名规则
+
+MMPretrain 按照以下风格进行模型命名,代码库的贡献者需要遵循相同的命名规则。模型名总体分为五个部分:算法信息,模块信息,预训练信息,训练信息和数据信息。逻辑上属于不同部分的单词之间用下划线 `'_'` 连接,同一部分有多个单词用短横线 `'-'` 连接。
+
+```text
+{algorithm info}_{module info}_{pretrain info}_{training info}_{data info}
+```
+
+- `algorithm info`(可选):算法信息,表示用以训练该模型的主要算法,如 MAE、BEiT 等
+- `module info`:模块信息,主要包含模型的主干网络名称,如 resnet、vit 等
+- `pretrain info`(可选):预训练信息,比如预训练模型是在 ImageNet-21k 数据集上训练的等
+- `training info`:训练信息,训练策略设置,包括 batch size,schedule 以及数据增强等;
+- `data info`:数据信息,数据集名称、模态、输入尺寸等,如 imagenet, cifar 等;
+
+### 算法信息
+
+指用以训练该模型的算法名称,例如:
+
+- `simclr`
+- `mocov2`
+- `eva-mae-style`
+
+使用监督图像分类任务训练的模型可以省略这个字段。
+
+### 模块信息
+
+指模型的结构信息,一般主要包含模型的主干网络结构,`neck` 和 `head` 信息一般被省略。例如:
+
+- `resnet50`
+- `vit-base-p16`
+- `swin-base`
+
+### 预训练信息
+
+如果该模型是在预训练模型基础上,通过微调获得的,我们需要记录预训练模型的一些信息。例如:
+
+- 预训练模型的来源:`fb`、`openai`等。
+- 训练预训练模型的方法:`clip`、`mae`、`distill` 等。
+- 用于预训练的数据集:`in21k`、`laion2b`等(`in1k`可以省略)
+- 训练时长:`300e`、`1600e` 等。
+
+并非所有信息都是必要的,只需要选择用以区分不同的预训练模型的信息即可。
+
+在此字段的末尾,使用 `-pre` 作为标识符,例如 `mae-in21k-pre`。
+
+### 训练信息
+
+训练策略的一些设置,包括训练类型、 `batch size`、 `lr schedule`、 数据增强以及特殊的损失函数等等,比如:
+Batch size 信息:
+
+- 格式为`{gpu x batch_per_gpu}`, 如 `8xb32`
+
+训练类型(主要见于 transformer 网络,如 `ViT` 算法,这类算法通常分为预训练和微调两种模式):
+
+- `ft` : Finetune config,用于微调的配置文件
+- `pt` : Pretrain config,用于预训练的配置文件
+
+训练策略信息,训练策略以复现配置文件为基础,此基础不必标注训练策略。但如果在此基础上进行改进,则需注明训练策略,按照应用点位顺序排列,如:`{pipeline aug}-{train aug}-{loss trick}-{scheduler}-{epochs}`
+
+- `coslr-200e` : 使用 cosine scheduler, 训练 200 个 epoch
+- `autoaug-mixup-lbs-coslr-50e` : 使用了 `autoaug`、`mixup`、`label smooth`、`cosine scheduler`, 训练了 50 个轮次
+
+如果模型是从官方仓库等第三方仓库转换过来的,训练信息可以省略,使用 `3rdparty` 作为标识符。
+
+### 数据信息
+
+- `in1k` : `ImageNet1k` 数据集,默认使用 `224x224` 大小的图片
+- `in21k` : `ImageNet21k` 数据集,有些地方也称为 `ImageNet22k` 数据集,默认使用 `224x224` 大小的图片
+- `in1k-384px` : 表示训练的输出图片大小为 `384x384`
+- `cifar100`
+
+### 模型命名案例
+
+```text
+vit-base-p32_clip-openai-pre_3rdparty_in1k
+```
+
+- `vit-base-p32`: 模块信息
+- `clip-openai-pre`:预训练信息
+ - `clip`:预训练方法是 clip
+ - `openai`:预训练模型来自 OpenAI
+ - `pre`:预训练标识符
+- `3rdparty`:模型是从第三方仓库转换而来的
+- `in1k`:数据集信息。该模型是从 ImageNet-1k 数据集训练而来的,输入大小为 `224x224`
+
+```text
+beit_beit-base-p16_8xb256-amp-coslr-300e_in1k
+```
+
+- `beit`: 算法信息
+- `beit-base`:模块信息,由于主干网络来自 BEiT 中提出的修改版 ViT,主干网络名称也是 `beit`
+- `8xb256-amp-coslr-300e`:训练信息
+ - `8xb256`:使用 8 个 GPU,每个 GPU 的批量大小为 256
+ - `amp`:使用自动混合精度训练
+ - `coslr`:使用余弦退火学习率调度器
+ - `300e`:训练 300 个 epoch
+- `in1k`:数据集信息。该模型是从 ImageNet-1k 数据集训练而来的,输入大小为 `224x224`
+
+## 配置文件命名规则
+
+配置文件的命名与模型名称几乎相同,有几点不同:
+
+- 训练信息是必要的,不能是 `3rdparty`
+- 如果配置文件只包含主干网络设置,既没有头部设置也没有数据集设置,我们将其命名为`{module info}_headless.py`。这种配置文件通常用于大型数据集上的第三方预训练模型。
+
+### 权重命名规则
+
+权重的命名主要包括模型名称,日期和哈希值。
+
+```text
+{model_name}_{date}-{hash}.pth
+```
diff --git a/docs/zh_CN/advanced_guides/datasets.md b/docs/zh_CN/advanced_guides/datasets.md
new file mode 100644
index 0000000000000000000000000000000000000000..83b7959b9f136e0938c89fe6f171f33c2eedde35
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/datasets.md
@@ -0,0 +1,73 @@
+# 添加新数据集
+
+用户可以编写一个继承自 [BasesDataset](https://mmpretrain.readthedocs.io/zh_CN/latest/_modules/mmpretrain/datasets/base_dataset.html#BaseDataset) 的新数据集类,并重载 `load_data_list(self)` 方法,类似 [CIFAR10](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/cifar.py) 和 [ImageNet](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/imagenet.py)。
+
+通常,此方法返回一个包含所有样本的列表,其中的每个样本都是一个字典。字典中包含了必要的数据信息,例如 `img` 和 `gt_label`。
+
+假设我们将要实现一个 `Filelist` 数据集,该数据集将使用文件列表进行训练和测试。注释列表的格式如下:
+
+```text
+000001.jpg 0
+000002.jpg 1
+...
+```
+
+## 1. 创建数据集类
+
+我们可以在 `mmpretrain/datasets/filelist.py` 中创建一个新的数据集类以加载数据。
+
+```python
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class Filelist(BaseDataset):
+
+ def load_data_list(self):
+ assert isinstance(self.ann_file, str)
+
+ data_list = []
+ with open(self.ann_file) as f:
+ samples = [x.strip().split(' ') for x in f.readlines()]
+ for filename, gt_label in samples:
+ img_path = add_prefix(filename, self.img_prefix)
+ info = {'img_path': img_path, 'gt_label': int(gt_label)}
+ data_list.append(info)
+ return data_list
+```
+
+## 2. 添加到库
+
+将新的数据集类加入到 `mmpretrain/datasets/__init__.py` 中:
+
+```python
+from .base_dataset import BaseDataset
+...
+from .filelist import Filelist
+
+__all__ = [
+ 'BaseDataset', ... ,'Filelist'
+]
+```
+
+### 3. 修改相关配置文件
+
+然后在配置文件中,为了使用 `Filelist`,用户可以按以下方式修改配置
+
+```python
+train_dataloader = dict(
+ ...
+ dataset=dict(
+ type='Filelist',
+ ann_file='image_list.txt',
+ pipeline=train_pipeline,
+ )
+)
+```
+
+所有继承 [`BaseDataset`](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/base_dataset.py) 的数据集类都具有**懒加载**以及**节省内存**的特性,可以参考相关文档 {external+mmengine:doc}`BaseDataset `。
+
+```{note}
+如果数据样本时获取的字典中,只包含了 'img_path' 不包含 'img', 则在 pipeline 中必须包含 'LoadImgFromFile'。
+```
diff --git a/docs/zh_CN/advanced_guides/evaluation.md b/docs/zh_CN/advanced_guides/evaluation.md
new file mode 100644
index 0000000000000000000000000000000000000000..32db19750458a1b297a8d444df00df76699bd5ef
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/evaluation.md
@@ -0,0 +1,97 @@
+# 自定义评估指标
+
+## 使用 MMPretrain 中的指标
+
+在 MMPretrain 中,我们为单标签分类和多标签分类提供了多种指标:
+
+**单标签分类**:
+
+- [`Accuracy`](mmpretrain.evaluation.Accuracy)
+- [`SingleLabelMetric`](mmpretrain.evaluation.SingleLabelMetric),包括精度、召回率、f1-score 和支持度。
+
+**多标签分类**:
+
+- [`AveragePrecision`](mmpretrain.evaluation.AveragePrecision), 或 AP (mAP)。
+- [`MultiLabelMetric`](mmpretrain.evaluation.MultiLabelMetric),包括精度、召回率、f1-score 和支持度。
+
+要在验证和测试期间使用这些指标,我们需要修改配置文件中的 `val_evaluator` 和 `test_evaluator` 字段。
+
+以下为几个例子:
+
+1. 在验证和测试期间计算 top-1 和 top-5 准确率。
+
+ ```python
+ val_evaluator = dict(type='Accuracy', topk=(1, 5))
+ test_evaluator = val_evaluator
+ ```
+
+2. 在验证和测试期间计算 top-1 准确率、top-5 准确度、精确度和召回率。
+
+ ```python
+ val_evaluator = [
+ dict(type='Accuracy', topk=(1, 5)),
+ dict(type='SingleLabelMetric', items=['precision', 'recall']),
+ ]
+ test_evaluator = val_evaluator
+ ```
+
+3. 计算 mAP(平均平均精度)、CP(类别平均精度)、CR(类别平均召回率)、CF(类别平均 F1 分数)、OP(总体平均精度)、OR(总体平均召回率)和 OF1(总体平均 F1 分数)。
+
+ ```python
+ val_evaluator = [
+ dict(type='AveragePrecision'),
+ dict(type='MultiLabelMetric', average='macro'), # class-wise mean
+ dict(type='MultiLabelMetric', average='micro'), # overall mean
+ ]
+ test_evaluator = val_evaluator
+ ```
+
+## 添加新的指标
+
+MMPretrain 支持为追求更高定制化的用户实现定制化的评估指标。
+
+您需要在 `mmpretrain/evaluation/metrics` 下创建一个新文件,并在该文件中实现新的指标,例如,在 `mmpretrain/evaluation/metrics/my_metric.py` 中。并创建一个自定义的评估指标类 `MyMetric` 继承 [MMEngine 中的 BaseMetric](mmengine.evaluator.BaseMetric)。
+
+需要分别覆盖数据格式处理方法`process`和度量计算方法`compute_metrics`。 将其添加到“METRICS”注册表以实施任何自定义评估指标。
+
+```python
+from mmengine.evaluator import BaseMetric
+from mmpretrain.registry import METRICS
+
+@METRICS.register_module()
+class MyMetric(BaseMetric):
+
+ def process(self, data_batch: Sequence[Dict], data_samples: Sequence[Dict]):
+ """ The processed results should be stored in ``self.results``, which will
+ be used to computed the metrics when all batches have been processed.
+ `data_batch` stores the batch data from dataloader,
+ and `data_samples` stores the batch outputs from model.
+ """
+ ...
+
+ def compute_metrics(self, results: List):
+ """ Compute the metrics from processed results and returns the evaluation results.
+ """
+ ...
+```
+
+然后,将其导入 `mmpretrain/evaluation/metrics/__init__.py` 以将其添加到 `mmpretrain.evaluation` 包中。
+
+```python
+# In mmpretrain/evaluation/metrics/__init__.py
+...
+from .my_metric import MyMetric
+
+__all__ = [..., 'MyMetric']
+```
+
+最后,在配置文件的 `val_evaluator` 和 `test_evaluator` 字段中使用 `MyMetric`。
+
+```python
+val_evaluator = dict(type='MyMetric', ...)
+test_evaluator = val_evaluator
+```
+
+```{note}
+更多的细节可以参考 {external+mmengine:doc}`MMEngine 文档: Evaluation `.
+```
diff --git a/docs/zh_CN/advanced_guides/modules.md b/docs/zh_CN/advanced_guides/modules.md
new file mode 100644
index 0000000000000000000000000000000000000000..cb0fac6a11a79a736a7e7290e8e107745bb98d57
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/modules.md
@@ -0,0 +1,512 @@
+# 自定义模型
+
+在我们的设计中,我们定义一个完整的模型为顶层模块,根据功能的不同,基本几种不同类型的模型组件组成。
+
+- 模型:顶层模块定义了具体的任务类型,例如 `ImageClassifier` 用在图像分类任务中, `MAE` 用在自监督学习中, `ImageToImageRetriever` 用在图像检索中。
+- 主干网络:通常是一个特征提取网络,涵盖了模型之间绝大多数的差异,例如 `ResNet`、`MobileNet`。
+- 颈部:用于连接主干网络和头部的组件,例如 `GlobalAveragePooling`。
+- 头部:用于执行特定任务的组件,例如 `ClsHead`、 `ContrastiveHead`。
+- 损失函数:在头部用于计算损失函数的组件,例如 `CrossEntropyLoss`、`LabelSmoothLoss`。
+- 目标生成器: 用于自监督学习任务的组件,例如 `VQKD`、 `HOGGenerator`。
+
+## 添加新的顶层模型
+
+通常来说,对于图像分类和图像检索任务来说,模型顶层模型流程基本一致。但是不同的自监督学习算法却用不同的计算流程,像 `MAE` 和 `BEiT` 就大不相同。 所以在这个部分,我们将简单介绍如何添加一个新的自监督学习算法。
+
+### 添加新的自监督学习算法
+
+1. 创建新文件 `mmpretrain/models/selfsup/new_algorithm.py` 以及实现 `NewAlgorithm`
+
+ ```python
+ from mmpretrain.registry import MODELS
+ from .base import BaseSelfSupvisor
+
+
+ @MODELS.register_module()
+ class NewAlgorithm(BaseSelfSupvisor):
+
+ def __init__(self, backbone, neck=None, head=None, init_cfg=None):
+ super().__init__(init_cfg)
+ pass
+
+ # ``extract_feat`` function is defined in BaseSelfSupvisor, you could
+ # overwrite it if needed
+ def extract_feat(self, inputs, **kwargs):
+ pass
+
+ # the core function to compute the loss
+ def loss(self, inputs, data_samples, **kwargs):
+ pass
+
+ ```
+
+2. 在 `mmpretrain/models/selfsup/__init__.py` 中导入对应的新算法
+
+ ```python
+ ...
+ from .new_algorithm import NewAlgorithm
+
+ __all__ = [
+ ...,
+ 'NewAlgorithm',
+ ...
+ ]
+ ```
+
+3. 在配置文件中使用新算法
+
+ ```python
+ model = dict(
+ type='NewAlgorithm',
+ backbone=...,
+ neck=...,
+ head=...,
+ ...
+ )
+ ```
+
+## 添加新的主干网络
+
+这里,我们以 `ResNet_CIFAR` 为例,展示了如何开发一个新的主干网络组件。
+
+`ResNet_CIFAR` 针对 CIFAR 32x32 的图像输入,远小于大多数模型使用的ImageNet默认的224x224输入配置,所以我们将骨干网络中 `kernel_size=7,stride=2`
+的设置替换为 `kernel_size=3, stride=1`,并移除了 stem 层之后的
+`MaxPooling`,以避免传递过小的特征图到残差块中。
+
+最简单的方式就是继承自 `ResNet` 并只修改 stem 层。
+
+1. 创建一个新文件 `mmpretrain/models/backbones/resnet_cifar.py`。
+
+ ```python
+ import torch.nn as nn
+
+ from mmpretrain.registry import MODELS
+ from .resnet import ResNet
+
+
+ @MODELS.register_module()
+ class ResNet_CIFAR(ResNet):
+
+ """ResNet backbone for CIFAR.
+
+ (对这个主干网络的简短描述)
+
+ Args:
+ depth(int): Network depth, from {18, 34, 50, 101, 152}.
+ ...
+ (参数文档)
+ """
+
+ def __init__(self, depth, deep_stem=False, **kwargs):
+ # 调用基类 ResNet 的初始化函数
+ super(ResNet_CIFAR, self).__init__(depth, deep_stem=deep_stem **kwargs)
+ # 其他特殊的初始化流程
+ assert not self.deep_stem, 'ResNet_CIFAR do not support deep_stem'
+
+ def _make_stem_layer(self, in_channels, base_channels):
+ # 重载基类的方法,以实现对网络结构的修改
+ self.conv1 = build_conv_layer(
+ self.conv_cfg,
+ in_channels,
+ base_channels,
+ kernel_size=3,
+ stride=1,
+ padding=1,
+ bias=False)
+ self.norm1_name, norm1 = build_norm_layer(
+ self.norm_cfg, base_channels, postfix=1)
+ self.add_module(self.norm1_name, norm1)
+ self.relu = nn.ReLU(inplace=True)
+
+ def forward(self, x):
+ # 如果需要的话,可以自定义forward方法
+ x = self.conv1(x)
+ x = self.norm1(x)
+ x = self.relu(x)
+ outs = []
+ for i, layer_name in enumerate(self.res_layers):
+ res_layer = getattr(self, layer_name)
+ x = res_layer(x)
+ if i in self.out_indices:
+ outs.append(x)
+ # 输出值需要是一个包含不同层多尺度输出的元组
+ # 如果不需要多尺度特征,可以直接在最终输出上包一层元组
+ return tuple(outs)
+
+ def init_weights(self):
+ # 如果需要的话,可以自定义权重初始化的方法
+ super().init_weights()
+
+ # 如果有预训练模型,则不需要进行权重初始化
+ if self.init_cfg is not None and self.init_cfg['type'] == 'Pretrained':
+ return
+
+ # 通常来说,我们建议用`init_cfg`去列举不同层权重初始化方法
+ # 包括卷积层,线性层,归一化层等等
+ # 如果有特殊需要,可以在这里进行额外的初始化操作
+ ...
+ ```
+
+```{note}
+在 OpenMMLab 2.0 的设计中,将原有的`BACKBONES`、`NECKS`、`HEADS`、`LOSSES`等注册名统一为`MODELS`.
+```
+
+2. 在 `mmpretrain/models/backbones/__init__.py` 中导入新模块
+
+ ```python
+ ...
+ from .resnet_cifar import ResNet_CIFAR
+
+ __all__ = [
+ ..., 'ResNet_CIFAR'
+ ]
+ ```
+
+3. 在配置文件中使用新的主干网络
+
+ ```python
+ model = dict(
+ ...
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=18,
+ other_arg=xxx),
+ ...
+ ```
+
+### 为自监督学习添加新的主干网络
+
+对于一部分自监督学习算法,主干网络做了一定修改,例如 `MAE`、`BEiT` 等。 这些主干网络需要处理 `mask` 相关的逻辑,以此从可见的图像块中提取对应的特征信息。
+
+以 [MAEViT](mmpretrain.models.selfsup.MAEViT) 作为例子,我们需要重写 `forward` 函数,进行基于 `mask` 的计算。我们实现了 `init_weights` 进行特定权重的初始化和 `random_masking` 函数来生成 `MAE` 预训练所需要的 `mask`。
+
+```python
+class MAEViT(VisionTransformer):
+ """Vision Transformer for MAE pre-training"""
+
+ def __init__(mask_ratio, **kwargs) -> None:
+ super().__init__(**kwargs)
+ # position embedding is not learnable during pretraining
+ self.pos_embed.requires_grad = False
+ self.mask_ratio = mask_ratio
+ self.num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+ def init_weights(self) -> None:
+ """Initialize position embedding, patch embedding and cls token."""
+ super().init_weights()
+ # define what if needed
+ pass
+
+ def random_masking(
+ self,
+ x: torch.Tensor,
+ mask_ratio: float = 0.75
+ ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+ """Generate the mask for MAE Pre-training."""
+ pass
+
+ def forward(
+ self,
+ x: torch.Tensor,
+ mask: Optional[bool] = True
+ ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+ """Generate features for masked images.
+
+ The function supports two kind of forward behaviors. If the ``mask`` is
+ ``True``, the function will generate mask to masking some patches
+ randomly and get the hidden features for visible patches, which means
+ the function will be executed as masked imagemodeling pre-training;
+ if the ``mask`` is ``None`` or ``False``, the forward function will
+ call ``super().forward()``, which extract features from images without
+ mask.
+ """
+ if mask is None or False:
+ return super().forward(x)
+
+ else:
+ B = x.shape[0]
+ x = self.patch_embed(x)[0]
+ # add pos embed w/o cls token
+ x = x + self.pos_embed[:, 1:, :]
+
+ # masking: length -> length * mask_ratio
+ x, mask, ids_restore = self.random_masking(x, self.mask_ratio)
+
+ # append cls token
+ cls_token = self.cls_token + self.pos_embed[:, :1, :]
+ cls_tokens = cls_token.expand(B, -1, -1)
+ x = torch.cat((cls_tokens, x), dim=1)
+
+ for _, layer in enumerate(self.layers):
+ x = layer(x)
+ # Use final norm
+ x = self.norm1(x)
+
+ return (x, mask, ids_restore)
+
+```
+
+## 添加新的颈部组件
+
+这里我们以 `GlobalAveragePooling` 为例。这是一个非常简单的颈部组件,没有任何参数。
+
+要添加新的颈部组件,我们主要需要实现 `forward` 函数,该函数对主干网络的输出进行
+一些操作并将结果传递到头部。
+
+1. 创建一个新文件 `mmpretrain/models/necks/gap.py`
+
+ ```python
+ import torch.nn as nn
+
+ from mmpretrain.registry import MODELS
+
+ @MODELS.register_module()
+ class GlobalAveragePooling(nn.Module):
+
+ def __init__(self):
+ self.gap = nn.AdaptiveAvgPool2d((1, 1))
+
+ def forward(self, inputs):
+ # 简单起见,我们默认输入是一个张量
+ outs = self.gap(inputs)
+ outs = outs.view(inputs.size(0), -1)
+ return outs
+ ```
+
+2. 在 `mmpretrain/models/necks/__init__.py` 中导入新模块
+
+ ```python
+ ...
+ from .gap import GlobalAveragePooling
+
+ __all__ = [
+ ..., 'GlobalAveragePooling'
+ ]
+ ```
+
+3. 修改配置文件以使用新的颈部组件
+
+ ```python
+ model = dict(
+ neck=dict(type='GlobalAveragePooling'),
+ )
+ ```
+
+## 添加新的头部组件
+
+### 基于分类头
+
+在此,我们以一个简化的 `VisionTransformerClsHead` 为例,说明如何开发新的头部组件。
+
+要添加一个新的头部组件,基本上我们需要实现 `pre_logits` 函数用于进入最后的分类头之前需要的处理,
+以及 `forward` 函数。
+
+1. 创建一个文件 `mmpretrain/models/heads/vit_head.py`.
+
+ ```python
+ import torch.nn as nn
+
+ from mmpretrain.registry import MODELS
+ from .cls_head import ClsHead
+
+
+ @MODELS.register_module()
+ class LinearClsHead(ClsHead):
+
+ def __init__(self, num_classes, in_channels, hidden_dim, **kwargs):
+ super().__init__(**kwargs)
+ self.in_channels = in_channels
+ self.num_classes = num_classes
+ self.hidden_dim = hidden_dim
+
+ self.fc1 = nn.Linear(in_channels, hidden_dim)
+ self.act = nn.Tanh()
+ self.fc2 = nn.Linear(hidden_dim, num_classes)
+
+ def pre_logits(self, feats):
+ # 骨干网络的输出通常包含多尺度信息的元组
+ # 对于分类任务来说,我们只需要关注最后的输出
+ feat = feats[-1]
+
+ # VisionTransformer的最终输出是一个包含patch tokens和cls tokens的元组
+ # 这里我们只需要cls tokens
+ _, cls_token = feat
+
+ # 完成除了最后的线性分类头以外的操作
+ return self.act(self.fc1(cls_token))
+
+ def forward(self, feats):
+ pre_logits = self.pre_logits(feats)
+
+ # 完成最后的分类头
+ cls_score = self.fc(pre_logits)
+ return cls_score
+ ```
+
+2. 在 `mmpretrain/models/heads/__init__.py` 中导入这个模块
+
+ ```python
+ ...
+ from .vit_head import VisionTransformerClsHead
+
+ __all__ = [
+ ..., 'VisionTransformerClsHead'
+ ]
+ ```
+
+3. 修改配置文件以使用新的头部组件。
+
+ ```python
+ model = dict(
+ head=dict(
+ type='VisionTransformerClsHead',
+ ...,
+ ))
+ ```
+
+### 基于 BaseModule 类
+
+这是一个基于 MMEngine 中的 `BaseModule` 进行开发例子,`MAEPretrainHead`,主要是为了 `MAE` 掩码学习。我们需要实现 `loss` 函数来计算损失吗,不过其它的函数均为可选项。
+
+```python
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class MAEPretrainHead(BaseModule):
+ """Head for MAE Pre-training."""
+
+ def __init__(self,
+ loss: dict,
+ norm_pix: bool = False,
+ patch_size: int = 16) -> None:
+ super().__init__()
+ self.norm_pix = norm_pix
+ self.patch_size = patch_size
+ self.loss_module = MODELS.build(loss)
+
+ def patchify(self, imgs: torch.Tensor) -> torch.Tensor:
+ """Split images into non-overlapped patches."""
+ p = self.patch_size
+ assert imgs.shape[2] == imgs.shape[3] and imgs.shape[2] % p == 0
+
+ h = w = imgs.shape[2] // p
+ x = imgs.reshape(shape=(imgs.shape[0], 3, h, p, w, p))
+ x = torch.einsum('nchpwq->nhwpqc', x)
+ x = x.reshape(shape=(imgs.shape[0], h * w, p**2 * 3))
+ return x
+
+ def construct_target(self, target: torch.Tensor) -> torch.Tensor:
+ """Construct the reconstruction target."""
+ target = self.patchify(target)
+ if self.norm_pix:
+ # normalize the target image
+ mean = target.mean(dim=-1, keepdim=True)
+ var = target.var(dim=-1, keepdim=True)
+ target = (target - mean) / (var + 1.e-6)**.5
+
+ return target
+
+ def loss(self, pred: torch.Tensor, target: torch.Tensor,
+ mask: torch.Tensor) -> torch.Tensor:
+ """Generate loss."""
+ target = self.construct_target(target)
+ loss = self.loss_module(pred, target, mask)
+
+ return loss
+```
+
+完成实现后,之后的步骤和 [基于分类头](#基于分类头) 中的步骤 2 和步骤 3 一致。
+
+## 添加新的损失函数
+
+要添加新的损失函数,我们主要需要在损失函数模块中 `forward` 函数。这里需要注意的是,损失模块也应该注册到`MODELS`中。另外,利用装饰器 `weighted_loss` 可以方便的实现对每个元素的损失进行加权平均。
+
+假设我们要模拟从另一个分类模型生成的概率分布,需要添加 `L1loss` 来实现该目的。
+
+1. 创建一个新文件 `mmpretrain/models/losses/l1_loss.py`
+
+ ```python
+ import torch
+ import torch.nn as nn
+
+ from mmpretrain.registry import MODELS
+ from .utils import weighted_loss
+
+ @weighted_loss
+ def l1_loss(pred, target):
+ assert pred.size() == target.size() and target.numel() > 0
+ loss = torch.abs(pred - target)
+ return loss
+
+ @MODELS.register_module()
+ class L1Loss(nn.Module):
+
+ def __init__(self, reduction='mean', loss_weight=1.0):
+ super(L1Loss, self).__init__()
+ self.reduction = reduction
+ self.loss_weight = loss_weight
+
+ def forward(self,
+ pred,
+ target,
+ weight=None,
+ avg_factor=None,
+ reduction_override=None):
+ assert reduction_override in (None, 'none', 'mean', 'sum')
+ reduction = (
+ reduction_override if reduction_override else self.reduction)
+ loss = self.loss_weight * l1_loss(
+ pred, target, weight, reduction=reduction, avg_factor=avg_factor)
+ return loss
+ ```
+
+2. 在文件 `mmpretrain/models/losses/__init__.py` 中导入这个模块
+
+ ```python
+ ...
+ from .l1_loss import L1Loss
+
+ __all__ = [
+ ..., 'L1Loss'
+ ]
+ ```
+
+3. 修改配置文件中的 `loss` 字段以使用新的损失函数
+
+ ```python
+ model = dict(
+ head=dict(
+ loss=dict(type='L1Loss', loss_weight=1.0),
+ ))
+ ```
+
+最后我们可以在配置文件中结合所有新增的模型组件来使用新的模型。由于`ResNet_CIFAR` 不是一个基于ViT的骨干网络,这里我们不用`VisionTransformerClsHead`的配置。
+
+```python
+model = dict(
+ type='ImageClassifier',
+ backbone=dict(
+ type='ResNet_CIFAR',
+ depth=18,
+ num_stages=4,
+ out_indices=(3, ),
+ style='pytorch'),
+ neck=dict(type='GlobalAveragePooling'),
+ head=dict(
+ type='LinearClsHead',
+ num_classes=10,
+ in_channels=512,
+ loss=dict(type='L1Loss', loss_weight=1.0),
+ topk=(1, 5),
+ ))
+
+```
+
+```{tip}
+为了方便,相同的模型组件可以直接从已有的config文件里继承,更多细节可以参考[学习配置文件](../user_guides/config.md)。
+```
diff --git a/docs/zh_CN/advanced_guides/pipeline.md b/docs/zh_CN/advanced_guides/pipeline.md
new file mode 100644
index 0000000000000000000000000000000000000000..99506b0848008befab6781771071cbb54cf2bfb0
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/pipeline.md
@@ -0,0 +1,148 @@
+# 自定义数据处理流程
+
+## 数据流的设计
+
+在[新数据集教程](./datasets.md)中,我们知道数据集类使用 `load_data_list` 方法来初始化整个数据集,我们将每个样本的信息保存到一个 dict 中。
+
+通常,为了节省内存,我们只加载 `load_data_list` 中的图片路径和标签,使用时加载完整的图片内容。此外,我们可能希望在训练时选择样本时进行一些随机数据扩充。几乎所有的数据加载、预处理和格式化操作都可以通过**数据管道**在 MMPretrain 中进行配置。
+
+数据管道意味着在从数据集中索引样本时如何处理样本字典,它由一系列数据变换组成。每个数据变换都将一个字典作为输入,对其进行处理,并为下一个数据变换输出一个字典。
+
+这是 ImageNet 上 ResNet-50 训练的数据管道示例。
+
+```python
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(type='PackInputs'),
+]
+```
+
+MMPretrain 中所有可用的数据变换都可以在 [数据变换文档](mmpretrain.datasets.transforms) 中找到。
+
+## 修改训练/测试管道
+
+MMPretrain 中的数据管道非常灵活。您几乎可以从配置文件中控制数据预处理的每一步,但另一方面,面对如此多的选项,您可能会感到困惑。
+
+这是图像分类任务的常见做法和指南。
+
+### 读取
+
+在数据管道的开始,我们通常需要从文件路径加载图像数据。
+[`LoadImageFromFile`](mmcv.transforms.LoadImageFromFile) 通常用于执行此任务。
+
+```python
+train_pipeline = [
+ dict(type='LoadImageFromFile'),
+ ...
+]
+```
+
+如果您想从具有特殊格式或特殊位置的文件中加载数据,您可以 [实施新的加载变换](#添加新的数据变换) 并将其添加到数据管道的开头。
+
+### 增强和其它处理
+
+在训练过程中,我们通常需要做数据增强来避免过拟合。在测试过程中,我们还需要做一些数据处理,比如调整大小和裁剪。这些数据变换将放置在加载过程之后。
+
+这是一个简单的数据扩充方案示例。它会将输入图像随机调整大小并裁剪到指定比例,并随机水平翻转图像。
+
+```python
+train_pipeline = [
+ ...
+ dict(type='RandomResizedCrop', scale=224),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ ...
+]
+```
+
+这是 [Swin-Transformer](../papers/swin_transformer.md) 训练中使用的大量数据增强配方示例。 为了与官方实施保持一致,它指定 `pillow` 作为调整大小后端,`bicubic` 作为调整大小算法。 此外,它添加了 [`RandAugment`](mmpretrain.datasets.transforms.RandAugment) 和 [`RandomErasing`](mmpretrain.datasets.transforms.RandomErasing) 作为额外的数据增强方法。
+
+此配置指定了数据扩充的每个细节,您只需将其复制到您自己的配置文件中即可应用 Swin-Transformer 的数据扩充。
+
+```python
+bgr_mean = [103.53, 116.28, 123.675]
+bgr_std = [57.375, 57.12, 58.395]
+
+train_pipeline = [
+ ...
+ dict(type='RandomResizedCrop', scale=224, backend='pillow', interpolation='bicubic'),
+ dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+ dict(
+ type='RandAugment',
+ policies='timm_increasing',
+ num_policies=2,
+ total_level=10,
+ magnitude_level=9,
+ magnitude_std=0.5,
+ hparams=dict(
+ pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+ dict(
+ type='RandomErasing',
+ erase_prob=0.25,
+ mode='rand',
+ min_area_ratio=0.02,
+ max_area_ratio=1 / 3,
+ fill_color=bgr_mean,
+ fill_std=bgr_std),
+ ...
+]
+```
+
+```{note}
+通常,数据管道中的数据增强部分仅处理图像方面的变换,而不处理图像归一化或混合/剪切混合等变换。 因为我们可以对 batch data 做 image normalization 和 mixup/cutmix 来加速。要配置图像归一化和 mixup/cutmix,请使用 [数据预处理器](mmpretrain.models.utils.data_preprocessor)。
+```
+
+### 格式化
+
+格式化是从数据信息字典中收集训练数据,并将这些数据转换为模型友好的格式。
+
+在大多数情况下,您可以简单地使用 [`PackInputs`](mmpretrain.datasets.transforms.PackInputs),它将 NumPy 数组格式的图像转换为 PyTorch 张量,并将 ground truth 类别信息和其他元信息打包为 [`DataSample`](mmpretrain.structures.DataSample)。
+
+```python
+train_pipeline = [
+ ...
+ dict(type='PackInputs'),
+]
+```
+
+## 添加新的数据变换
+
+1. 在任何文件中写入一个新的数据转换,例如 `my_transform.py`,并将其放在文件夹 `mmpretrain/datasets/transforms/` 中。 数据变换类需要继承 [`mmcv.transforms.BaseTransform`](mmcv.transforms.BaseTransform) 类并覆盖以字典作为输入并返回字典的 `transform` 方法。
+
+ ```python
+ from mmcv.transforms import BaseTransform
+ from mmpretrain.registry import TRANSFORMS
+
+ @TRANSFORMS.register_module()
+ class MyTransform(BaseTransform):
+
+ def transform(self, results):
+ # Modify the data information dict `results`.
+ return results
+ ```
+
+2. 在 `mmpretrain/datasets/transforms/__init__.py` 中导入新的变换
+
+ ```python
+ ...
+ from .my_transform import MyTransform
+
+ __all__ = [
+ ..., 'MyTransform'
+ ]
+ ```
+
+3. 在配置文件中使用
+
+ ```python
+ train_pipeline = [
+ ...
+ dict(type='MyTransform'),
+ ...
+ ]
+ ```
+
+## 数据管道可视化
+
+数据流水线设计完成后,可以使用 [可视化工具](../useful_tools/dataset_visualization.md) 查看效果。
diff --git a/docs/zh_CN/advanced_guides/runtime.md b/docs/zh_CN/advanced_guides/runtime.md
new file mode 100644
index 0000000000000000000000000000000000000000..e5fa3864a47a9ebb77ab992cc45b1162814f52fb
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/runtime.md
@@ -0,0 +1,213 @@
+# 自定义运行参数
+
+运行参数配置包括许多有用的功能,如权重文件保存、日志配置等等,在本教程中,我们将介绍如何配置这些功能。
+
+## 保存权重文件
+
+权重文件保存功能是一个在训练阶段默认注册的钩子, 你可以通过配置文件中的 `default_hooks.checkpoint` 字段配置它。
+
+```{note}
+钩子机制在 OpenMMLab 开源算法库中应用非常广泛。通过钩子,你可以在不修改运行器的主要执行逻辑的情况下插入许多功能。
+
+可以通过{external+mmengine:doc}`相关文章 `进一步理解钩子。
+```
+
+**默认配置:**
+
+```python
+default_hooks = dict(
+ ...
+ checkpoint = dict(type='CheckpointHook', interval=1)
+ ...
+)
+```
+
+下面是一些[权重文件钩子(CheckpointHook)](mmengine.hooks.CheckpointHook)的常用可配置参数。
+
+- **`interval`** (int): 文件保存周期。如果使用-1,它将永远不会保存权重。
+- **`by_epoch`** (bool): 选择 **`interval`** 是基于epoch还是基于iteration, 默认为 `True`.
+- **`out_dir`** (str): 保存权重文件的根目录。如果不指定,检查点将被保存在工作目录中。如果指定,检查点将被保存在 **`out_dir`** 的子文件夹中。
+- **`max_keep_ckpts`** (int): 要保留的权重文件数量。在某些情况下,为了节省磁盘空间,我们希望只保留最近的几个权重文件。默认为 -1,也就是无限制。
+- **`save_best`** (str, List[str]): 如果指定,它将保存具有最佳评估结果的权重。
+ 通常情况下,你可以直接使用`save_best="auto"`来自动选择评估指标。
+
+而如果你想要更高级的配置,请参考[权重文件钩子(CheckpointHook)](tutorials/hook.md#checkpointhook)。
+
+## 权重加载 / 断点训练
+
+在配置文件中,你可以加载指定模型权重或者断点继续训练,如下所示:
+
+```python
+# 从指定权重文件加载
+load_from = "Your checkpoint path"
+
+# 是否从加载的断点继续训练
+resume = False
+```
+
+`load_from` 字段可以是本地路径,也可以是HTTP路径。你可以从检查点恢复训练,方法是指定 `resume=True`。
+
+```{tip}
+你也可以通过指定 `load_from=None` 和 `resume=True` 启用从最新的断点自动恢复。
+Runner执行器将自动从工作目录中找到最新的权重文件。
+```
+
+如果你用我们的 `tools/train.py` 脚本来训练模型,你只需使用 `--resume` 参数来恢复训练,就不用手动修改配置文件了。如下所示:
+
+```bash
+# 自动从最新的断点恢复
+python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume
+
+# 从指定的断点恢复
+python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume checkpoints/resnet.pth
+```
+
+## 随机性(Randomness)配置
+
+为了让实验尽可能是可复现的, 我们在 `randomness` 字段中提供了一些控制随机性的选项。
+
+默认情况下,我们不会在配置文件中指定随机数种子,在每次实验中,程序会生成一个不同的随机数种子。
+
+**默认配置:**
+
+```python
+randomness = dict(seed=None, deterministic=False)
+```
+
+为了使实验更具可复现性,你可以指定一个种子并设置 `deterministic=True`。
+`deterministic` 选项的使用效果可以在[这里](https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking)找到。
+
+## 日志配置
+
+日志的配置与多个字段有关。
+
+在`log_level`字段中,你可以指定全局日志级别。参见 {external+python:ref}`Logging Levels` 以获得日志级别列表。
+
+```python
+log_level = 'INFO'
+```
+
+在 `default_hooks.logger` 字段中,你可以指定训练和测试期间的日志间隔。
+而所有可用的参数可以在[日志钩子文档](tutorials/hook.md#loggerhook)中找到。
+
+```python
+default_hooks = dict(
+ ...
+ # 每100次迭代就打印一次日志
+ logger=dict(type='LoggerHook', interval=100),
+ ...
+)
+```
+
+在 `log_processor` 字段中,你可以指定日志信息的平滑方法。
+通常,我们使用一个长度为10的窗口来平滑日志中的值,并输出所有信息的平均值。
+如果你想特别指定某些信息的平滑方法,请参阅{external+mmengine:doc}`日志处理器文档 `。
+
+```python
+# 默认设置,它将通过一个10长度的窗口平滑训练日志中的值
+log_processor = dict(window_size=10)
+```
+
+在 `visualizer` 字段中,你可以指定多个后端来保存日志信息,如TensorBoard和WandB。
+更多的细节可以在[可视化工具](#visualizer)找到。
+
+## 自定义钩子
+
+上述许多功能是由钩子实现的,你也可以通过修改 `custom_hooks` 字段来插入其他的自定义钩子。
+下面是 MMEngine 和 MMPretrain 中的一些钩子,你可以直接使用,例如:
+
+- [EMAHook](mmpretrain.engine.hooks.EMAHook)
+- [SyncBuffersHook](mmengine.hooks.SyncBuffersHook)
+- [EmptyCacheHook](mmengine.hooks.EmptyCacheHook)
+- [ClassNumCheckHook](mmpretrain.engine.hooks.ClassNumCheckHook)
+- ......
+
+例如,EMA(Exponential Moving Average)在模型训练中被广泛使用,你可以以下方式启用它:
+
+```python
+custom_hooks = [
+ dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL'),
+]
+```
+
+## 验证可视化
+
+验证可视化钩子是一个验证过程中默认注册的钩子。
+你可以在 `default_hooks.visualization` 字段中来配置它。
+
+默认情况下,我们禁用这个钩子,你可以通过指定 `enable=True` 来启用它。而更多的参数可以在
+[可视化钩子文档](mmpretrain.engine.hooks.VisualizationHook)中找到。
+
+```python
+default_hooks = dict(
+ ...
+ visualization=dict(type='VisualizationHook', enable=False),
+ ...
+)
+```
+
+这个钩子将在验证数据集中选择一部分图像,在每次验证过程中记录并可视化它们的预测结果。
+你可以用它来观察训练期间模型在实际图像上的性能变化。
+
+此外,如果你的验证数据集中的图像很小(\<100, 如Cifra数据集),
+你可以指定 `rescale_factor` 来缩放它们,如 `rescale_factor=2.`, 将可视化的图像放大两倍。
+
+## Visualizer
+
+`Visualizer` 用于记录训练和测试过程中的各种信息,包括日志、图像和标量。
+默认情况下,记录的信息将被保存在工作目录下的 `vis_data` 文件夹中。
+
+**默认配置:**
+
+```python
+visualizer = dict(
+ type='UniversalVisualizer',
+ vis_backends=[
+ dict(type='LocalVisBackend'),
+ ]
+)
+```
+
+通常,最有用的功能是将日志和标量如 `loss` 保存到不同的后端。
+例如,要把它们保存到 TensorBoard,只需像下面这样设置:
+
+```python
+visualizer = dict(
+ type='UniversalVisualizer',
+ vis_backends=[
+ dict(type='LocalVisBackend'),
+ dict(type='TensorboardVisBackend'),
+ ]
+)
+```
+
+或者像下面这样把它们保存到 WandB:
+
+```python
+visualizer = dict(
+ type='UniversalVisualizer',
+ vis_backends=[
+ dict(type='LocalVisBackend'),
+ dict(type='WandbVisBackend'),
+ ]
+)
+```
+
+## 环境配置
+
+在 `env_cfg` 字段中,你可以配置一些底层的参数,如 cuDNN、多进程和分布式通信。
+
+**在修改这些参数之前,请确保你理解这些参数的含义。**
+
+```python
+env_cfg = dict(
+ # 是否启用cudnn基准测试
+ cudnn_benchmark=False,
+
+ # 设置多进程参数
+ mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+ # 设置分布式参数
+ dist_cfg=dict(backend='nccl'),
+)
+```
diff --git a/docs/zh_CN/advanced_guides/schedule.md b/docs/zh_CN/advanced_guides/schedule.md
new file mode 100644
index 0000000000000000000000000000000000000000..d1c347d11930acd7087701ae2db0e750a9012ef2
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/schedule.md
@@ -0,0 +1,359 @@
+# 自定义训练优化策略
+
+在我们的算法库中,已经提供了通用数据集(如ImageNet,CIFAR)的[默认训练策略配置](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/schedules)。如果想要在这些数据集上继续提升模型性能,或者在不同数据集和方法上进行新的尝试,我们通常需要修改这些默认的策略。
+
+在本教程中,我们将介绍如何在运行自定义训练时,通过修改配置文件进行构造优化器、参数化精细配置、梯度裁剪、梯度累计以及定制动量调整策略等。同时也会通过模板简单介绍如何自定义开发优化器和构造器。
+
+## 配置训练优化策略
+
+我们通过 `optim_wrapper` 来配置主要的优化策略,包括优化器的选择,混合精度训练的选择,参数化精细配置,梯度裁剪以及梯度累计。接下来将分别介绍这些内容。
+
+### 构造 PyTorch 内置优化器
+
+MMPretrain 支持 PyTorch 实现的所有优化器,仅需在配置文件中,指定优化器封装需要的 `optimizer` 字段。
+
+如果要使用 [`SGD`](torch.optim.SGD),则修改如下。这里要注意所有优化相关的配置都需要封装在 `optim_wrapper` 配置里。
+
+```python
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer=dict(type='SGD', lr=0.0003, weight_decay=0.0001)
+)
+```
+
+```{note}
+配置文件中的 'type' 不是构造时的参数,而是 PyTorch 内置优化器的类名。
+更多优化器选择可以参考{external+torch:ref}`PyTorch 支持的优化器列表`。
+```
+
+要修改模型的学习率,只需要在优化器的配置中修改 `lr` 即可。
+要配置其他参数,可直接根据 [PyTorch API 文档](torch.optim) 进行。
+
+例如,如果想使用 [`Adam`](torch.optim.Adam) 并设置参数为 `torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)`。
+则需要进行如下修改:
+
+```python
+optim_wrapper = dict(
+ type='OptimWrapper',
+ optimizer = dict(
+ type='Adam',
+ lr=0.001,
+ betas=(0.9, 0.999),
+ eps=1e-08,
+ weight_decay=0,
+ amsgrad=False),
+)
+```
+
+````{note}
+考虑到对于单精度训练来说,优化器封装的默认类型就是 `OptimWrapper`,我们在这里可以直接省略,因此配置文件可以进一步简化为:
+
+```python
+optim_wrapper = dict(
+ optimizer=dict(
+ type='Adam',
+ lr=0.001,
+ betas=(0.9, 0.999),
+ eps=1e-08,
+ weight_decay=0,
+ amsgrad=False))
+```
+````
+
+### 混合精度训练
+
+如果我们想要使用混合精度训练(Automactic Mixed Precision),我们只需简单地将 `optim_wrapper` 的类型改为 `AmpOptimWrapper`。
+
+```python
+optim_wrapper = dict(type='AmpOptimWrapper', optimizer=...)
+```
+
+另外,为了方便,我们同时在启动训练脚本 `tools/train.py` 中提供了 `--amp` 参数作为开启混合精度训练的开关,更多细节可以参考[训练教程](../user_guides/train.md)。
+
+### 参数化精细配置
+
+在一些模型中,不同的优化策略需要适应特定的参数,例如不在 BatchNorm 层使用权重衰减,或者在不同层使用不同的学习率等等。
+我们需要用到 `optim_wrapper` 中的 `paramwise_cfg` 参数来进行精细化配置。
+
+- **为不同类型的参数设置超参乘子**
+
+ 例如,我们可以在 `paramwise_cfg` 配置中设置 `norm_decay_mult=0.` 来改变归一化层权重和偏移的衰减为0。
+
+ ```python
+ optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.8, weight_decay=1e-4),
+ paramwise_cfg=dict(norm_decay_mult=0.))
+ ```
+
+ 支持更多类型的参数配置,参考以下列表:
+
+ - `bias_lr_mult`:偏置的学习率系数(不包括正则化层的偏置以及可变形卷积的 offset),默认值为 1
+ - `bias_decay_mult`:偏置的权值衰减系数(不包括正则化层的偏置以及可变形卷积的 offset),默认值为 1
+ - `norm_decay_mult`:正则化层权重和偏置的权值衰减系数,默认值为 1
+ - `flat_decay_mult`: 一维参数的权值衰减系数,默认值为 1
+ - `dwconv_decay_mult`:Depth-wise 卷积的权值衰减系数,默认值为 1
+ - `bypass_duplicate`:是否跳过重复的参数,默认为 `False`
+ - `dcn_offset_lr_mult`:可变形卷积(Deformable Convolution)的学习率系数,默认值为 1
+
+- **为特定参数设置超参乘子**
+
+ MMPretrain 通过 `paramwise_cfg` 的 `custom_keys` 参数来配置特定参数的超参乘子。
+
+ 例如,我们可以通过以下配置来设置所有 `backbone.layer0` 层的学习率和权重衰减为0, `backbone` 的其余层和优化器保持一致,另外 `head` 层的学习率为0.001.
+
+ ```python
+ optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+ paramwise_cfg=dict(
+ custom_keys={
+ 'backbone.layer0': dict(lr_mult=0, decay_mult=0),
+ 'backbone': dict(lr_mult=1),
+ 'head': dict(lr_mult=0.1)
+ }))
+ ```
+
+### 梯度裁剪
+
+在训练过程中,损失函数可能接近于一些异常陡峭的区域,从而导致梯度爆炸。而梯度裁剪可以帮助稳定训练过程,更多介绍可以参见[该页面](https://paperswithcode.com/method/gradient-clipping)。
+
+目前我们支持在 `optim_wrapper` 字段中添加 `clip_grad` 参数来进行梯度裁剪,更详细的参数可参考 [PyTorch 文档](torch.nn.utils.clip_grad_norm_)。
+
+用例如下:
+
+```python
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+ # norm_type: 使用的范数类型,此处使用范数2。
+ clip_grad=dict(max_norm=35, norm_type=2))
+```
+
+### 梯度累计
+
+计算资源缺乏缺乏时,每个训练批次的大小(batch size)只能设置为较小的值,这可能会影响模型的性能。
+
+可以使用梯度累计来规避这一问题。我们支持在 `optim_wrapper` 字段中添加 `accumulative_counts` 参数来进行梯度累计。
+
+用例如下:
+
+```python
+train_dataloader = dict(batch_size=64)
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+ accumulative_counts=4)
+```
+
+表示训练时,每 4 个 iter 执行一次反向传播。由于此时单张 GPU 上的批次大小为 64,也就等价于单张 GPU 上一次迭代的批次大小为 256,也即:
+
+```python
+train_dataloader = dict(batch_size=256)
+optim_wrapper = dict(
+ optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001))
+```
+
+## 配置参数优化策略
+
+在训练过程中,优化参数例如学习率、动量,通常不会是固定不变,而是随着训练进程的变化而调整。PyTorch 支持一些学习率调整的调度器,但是不足以完成复杂的策略。在 MMPretrain 中,我们提供 `param_scheduler` 来更好地控制不同优化参数的策略。
+
+### 配置学习率调整策略
+
+深度学习研究中,广泛应用学习率衰减来提高网络的性能。我们支持大多数 PyTorch 学习率调度器, 其中包括 `ExponentialLR`, `LinearLR`, `StepLR`, `MultiStepLR` 等等。
+
+- **单个学习率策略**
+
+ 多数情况下,我们使用单一学习率策略,这里 `param_scheduler` 会是一个字典。比如在默认的 ResNet 网络训练中,我们使用阶梯式的学习率衰减策略 [`MultiStepLR`](mmengine.optim.MultiStepLR),配置文件为:
+
+ ```python
+ param_scheduler = dict(
+ type='MultiStepLR',
+ by_epoch=True,
+ milestones=[100, 150],
+ gamma=0.1)
+ ```
+
+ 或者我们想使用 [`CosineAnnealingLR`](mmengine.optim.CosineAnnealingLR) 来进行学习率衰减:
+
+ ```python
+ param_scheduler = dict(
+ type='CosineAnnealingLR',
+ by_epoch=True,
+ T_max=num_epochs)
+ ```
+
+- **多个学习率策略**
+
+ 然而在一些其他情况下,为了提高模型的精度,通常会使用多种学习率策略。例如,在训练的早期阶段,网络容易不稳定,而学习率的预热就是为了减少这种不稳定性。
+
+ 整个学习过程中,学习率将会通过预热从一个很小的值逐步提高到预定值,再会通过其他的策略进一步调整。
+
+ 在 MMPretrain 中,我们同样使用 `param_scheduler` ,将多种学习策略写成列表就可以完成上述预热策略的组合。
+
+ 例如:
+
+ 1. 在前50次迭代中逐**迭代次数**地**线性**预热
+
+ ```python
+ param_scheduler = [
+ # 逐迭代次数,线性预热
+ dict(type='LinearLR',
+ start_factor=0.001,
+ by_epoch=False, # 逐迭代次数
+ end=50), # 只预热50次迭代次数
+ # 主要的学习率策略
+ dict(type='MultiStepLR',
+ by_epoch=True,
+ milestones=[8, 11],
+ gamma=0.1)
+ ]
+ ```
+
+ 2. 在前10轮迭代中逐**迭代次数**地**线性**预热
+
+ ```python
+ param_scheduler = [
+ # 在前10轮迭代中,逐迭代次数,线性预热
+ dict(type='LinearLR',
+ start_factor=0.001,
+ by_epoch=True,
+ end=10,
+ convert_to_iter_based=True, # 逐迭代次数更新学习率.
+ ),
+ # 在 10 轮次后,通过余弦退火衰减
+ dict(type='CosineAnnealingLR', by_epoch=True, begin=10)
+ ]
+ ```
+
+ 注意这里增加了 `begin` 和 `end` 参数,这两个参数指定了调度器的**生效区间**。生效区间通常只在多个调度器组合时才需要去设置,使用单个调度器时可以忽略。当指定了 `begin` 和 `end` 参数时,表示该调度器只在 [begin, end) 区间内生效,其单位是由 `by_epoch` 参数决定。在组合不同调度器时,各调度器的 `by_epoch` 参数不必相同。如果没有指定的情况下,`begin` 为 0, `end` 为最大迭代轮次或者最大迭代次数。
+
+ 如果相邻两个调度器的生效区间没有紧邻,而是有一段区间没有被覆盖,那么这段区间的学习率维持不变。而如果两个调度器的生效区间发生了重叠,则对多组调度器叠加使用,学习率的调整会按照调度器配置文件中的顺序触发(行为与 PyTorch 中 [`ChainedScheduler`](torch.optim.lr_scheduler.ChainedScheduler) 一致)。
+
+ ```{tip}
+ 为了避免学习率曲线与预期不符, 配置完成后,可以使用 MMPretrain 提供的 [学习率可视化工具](../useful_tools/scheduler_visualization.md) 画出对应学习率调整曲线。
+ ```
+
+### 配置动量调整策略
+
+MMPretrain 支持动量调度器根据学习率修改优化器的动量,从而使损失函数收敛更快。用法和学习率调度器一致。
+
+我们支持的动量策略和详细的使用细节可以参考[这里](https://github.com/open-mmlab/mmengine/blob/main/mmengine/optim/scheduler/momentum_scheduler.py)。我们只将调度器中的 `LR` 替换为了 `Momentum`,动量策略可以直接追加 `param_scheduler` 列表中。
+
+这里是一个用例:
+
+```python
+param_scheduler = [
+ # 学习率策略
+ dict(type='LinearLR', ...),
+ # 动量策略
+ dict(type='LinearMomentum',
+ start_factor=0.001,
+ by_epoch=False,
+ begin=0,
+ end=1000)
+]
+```
+
+## 新增优化器或者优化器构造器
+
+```{note}
+本部分将修改 MMPretrain 源码或者向 MMPretrain 框架添加代码,初学者可跳过。
+```
+
+### 新增优化器
+
+在学术研究和工业实践中,可能需要使用 MMPretrain 未实现的优化方法,可以通过以下方法添加。
+
+1. 定义一个新的优化器
+
+ 一个自定义的优化器可根据如下规则进行定制:
+
+ 假设我们想添加一个名为 `MyOptimzer` 的优化器,其拥有参数 `a`, `b` 和 `c`。
+ 可以创建一个名为 `mmpretrain/engine/optimizer` 的文件夹,并在目录下的一个文件,如 `mmpretrain/engine/optimizer/my_optimizer.py` 中实现该自定义优化器:
+
+ ```python
+ from mmpretrain.registry import OPTIMIZERS
+ from torch.optim import Optimizer
+
+
+ @OPTIMIZERS.register_module()
+ class MyOptimizer(Optimizer):
+
+ def __init__(self, a, b, c):
+ ...
+
+ def step(self, closure=None):
+ ...
+ ```
+
+2. 注册优化器
+
+ 要注册上面定义的上述模块,首先需要将此模块导入到主命名空间中。有两种方法可以实现它。
+
+ 修改 `mmpretrain/engine/optimizers/__init__.py`,将其导入至 `mmpretrain.engine` 包。
+
+ ```python
+ # 在 mmpretrain/engine/optimizers/__init__.py 中
+ ...
+ from .my_optimizer import MyOptimizer # MyOptimizer 是我们自定义的优化器的名字
+
+ __all__ = [..., 'MyOptimizer']
+ ```
+
+ 在运行过程中,我们会自动导入 `mmpretrain.engine` 包并同时注册 `MyOptimizer`。
+
+3. 在配置文件中指定优化器
+
+ 之后,用户便可在配置文件的 `optim_wrapper.optimizer` 域中使用 `MyOptimizer`:
+
+ ```python
+ optim_wrapper = dict(
+ optimizer=dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value))
+ ```
+
+### 新增优化器构造器
+
+某些模型可能具有一些特定于参数的设置以进行优化,例如为所有 BatchNorm 层设置不同的权重衰减。
+
+尽管我们已经可以使用 [`optim_wrapper.paramwise_cfg` 字段](#参数化精细配置)来配置特定参数的优化设置,但可能仍然无法覆盖你的需求。
+
+当然你可以在此基础上进行修改。我们默认使用 [`DefaultOptimWrapperConstructor`](mmengine.optim.DefaultOptimWrapperConstructor) 来构造优化器。在构造过程中,通过 `paramwise_cfg` 来精细化配置不同设置。这个默认构造器可以作为新优化器构造器实现的模板。
+
+我们可以新增一个优化器构造器来覆盖这些行为。
+
+```python
+# 在 mmpretrain/engine/optimizers/my_optim_constructor.py 中
+from mmengine.optim import DefaultOptimWrapperConstructor
+from mmpretrain.registry import OPTIM_WRAPPER_CONSTRUCTORS
+
+
+@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
+class MyOptimWrapperConstructor:
+
+ def __init__(self, optim_wrapper_cfg, paramwise_cfg=None):
+ ...
+
+ def __call__(self, model):
+ ...
+```
+
+这是一个已实现的 [OptimWrapperConstructor](mmpretrain.engine.optimizers.LearningRateDecayOptimWrapperConstructor) 具体例子。
+
+接下来类似 [新增优化器教程](#新增优化器) 来导入并使用新的优化器构造器。
+
+1. 修改 `mmpretrain/engine/optimizers/__init__.py`,将其导入至 `mmpretrain.engine` 包。
+
+ ```python
+ # 在 mmpretrain/engine/optimizers/__init__.py 中
+ ...
+ from .my_optim_constructor import MyOptimWrapperConstructor
+
+ __all__ = [..., 'MyOptimWrapperConstructor']
+ ```
+
+2. 在配置文件的 `optim_wrapper.constructor` 字段中使用 `MyOptimWrapperConstructor` 。
+
+ ```python
+ optim_wrapper = dict(
+ constructor=dict(type='MyOptimWrapperConstructor'),
+ optimizer=...,
+ paramwise_cfg=...,
+ )
+ ```
diff --git a/docs/zh_CN/api b/docs/zh_CN/api
new file mode 120000
index 0000000000000000000000000000000000000000..0ef434a4902196a4b89383d9cfb5f47b2e11a999
--- /dev/null
+++ b/docs/zh_CN/api
@@ -0,0 +1 @@
+../en/api
\ No newline at end of file
diff --git a/docs/zh_CN/conf.py b/docs/zh_CN/conf.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c372a8ae590fc889bf775bd041aed2991c15fbb
--- /dev/null
+++ b/docs/zh_CN/conf.py
@@ -0,0 +1,253 @@
+# flake8: noqa
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import subprocess
+import sys
+
+import pytorch_sphinx_theme
+from sphinx.builders.html import StandaloneHTMLBuilder
+
+sys.path.insert(0, os.path.abspath('../../'))
+
+# -- Project information -----------------------------------------------------
+
+project = 'MMPretrain'
+copyright = '2020, OpenMMLab'
+author = 'MMPretrain Authors'
+
+# The full version, including alpha/beta/rc tags
+version_file = '../../mmpretrain/version.py'
+
+
+def get_version():
+ with open(version_file, 'r') as f:
+ exec(compile(f.read(), version_file, 'exec'))
+ return locals()['__version__']
+
+
+release = get_version()
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+ 'sphinx.ext.autodoc',
+ 'sphinx.ext.autosummary',
+ 'sphinx.ext.intersphinx',
+ 'sphinx.ext.napoleon',
+ 'sphinx.ext.viewcode',
+ 'myst_parser',
+ 'sphinx_copybutton',
+ 'sphinx_tabs.tabs',
+ 'notfound.extension',
+ 'sphinxcontrib.jquery',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = {
+ '.rst': 'restructuredtext',
+ '.md': 'markdown',
+}
+
+language = 'zh_CN'
+
+# The master toctree document.
+root_doc = 'index'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages. See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'pytorch_sphinx_theme'
+html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further. For a list of options available for each theme, see the
+# documentation.
+# yapf: disable
+html_theme_options = {
+ 'menu': [
+ {
+ 'name': 'GitHub',
+ 'url': 'https://github.com/open-mmlab/mmpretrain'
+ },
+ {
+ 'name': 'Colab 教程',
+ 'children': [
+ {'name': '用命令行工具训练和推理',
+ 'url': 'https://colab.research.google.com/github/mzr1996/mmpretrain-tutorial/blob/master/1.x/MMPretrain_tools.ipynb'},
+ {'name': '用 Python API 训练和推理',
+ 'url': 'https://colab.research.google.com/github/mzr1996/mmpretrain-tutorial/blob/master/1.x/MMPretrain_python.ipynb'},
+ ]
+ },
+ {
+ 'name': 'Version',
+ 'children': [
+ {'name': 'MMPretrain 0.x',
+ 'url': 'https://mmpretrain.readthedocs.io/zh_CN/0.x/',
+ 'description': '0.x branch'},
+ {'name': 'MMPretrain 1.x',
+ 'url': 'https://mmpretrain.readthedocs.io/zh_CN/latest/',
+ 'description': 'Main branch'},
+ ],
+ }
+ ],
+ # Specify the language of shared menu
+ 'menu_lang': 'cn',
+ # Disable the default edit on GitHub
+ 'default_edit_on_github': False,
+}
+# yapf: enable
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+html_css_files = [
+ 'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.css',
+ 'css/readthedocs.css'
+]
+html_js_files = [
+ 'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.js',
+ 'js/custom.js'
+]
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'mmpretraindoc'
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+ # The paper size ('letterpaper' or 'a4paper').
+ #
+ # 'papersize': 'letterpaper',
+
+ # The font size ('10pt', '11pt' or '12pt').
+ #
+ # 'pointsize': '10pt',
+
+ # Additional stuff for the LaTeX preamble.
+ #
+ # 'preamble': '',
+
+ # Latex figure (float) alignment
+ #
+ # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+# author, documentclass [howto, manual, or own class]).
+latex_documents = [
+ (root_doc, 'mmpretrain.tex', 'MMPretrain Documentation', author, 'manual'),
+]
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(root_doc, 'mmpretrain', 'MMPretrain Documentation', [author], 1)]
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+# dir menu entry, description, category)
+texinfo_documents = [
+ (root_doc, 'mmpretrain', 'MMPretrain Documentation', author, 'mmpretrain',
+ 'OpenMMLab pre-training toolbox and benchmark.', 'Miscellaneous'),
+]
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+# set priority when building html
+StandaloneHTMLBuilder.supported_image_types = [
+ 'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
+]
+
+# -- Extension configuration -------------------------------------------------
+# Ignore >>> when copying code
+copybutton_prompt_text = r'>>> |\.\.\. '
+copybutton_prompt_is_regexp = True
+
+# Auto-generated header anchors
+myst_heading_anchors = 3
+# Enable "colon_fence" extension of myst.
+myst_enable_extensions = ['colon_fence', 'dollarmath']
+
+# Configuration for intersphinx
+intersphinx_mapping = {
+ 'python': ('https://docs.python.org/3', None),
+ 'numpy': ('https://numpy.org/doc/stable', None),
+ 'torch': ('https://pytorch.org/docs/stable/', None),
+ 'mmcv': ('https://mmcv.readthedocs.io/zh_CN/2.x/', None),
+ 'mmengine': ('https://mmengine.readthedocs.io/zh_CN/latest/', None),
+ 'transformers':
+ ('https://huggingface.co/docs/transformers/main/zh/', None),
+}
+napoleon_custom_sections = [
+ # Custom sections for data elements.
+ ('Meta fields', 'params_style'),
+ ('Data fields', 'params_style'),
+]
+
+# Disable docstring inheritance
+autodoc_inherit_docstrings = False
+# Mock some imports during generate API docs.
+autodoc_mock_imports = ['rich', 'attr', 'einops', 'mat4py']
+# Disable displaying type annotations, these can be very verbose
+autodoc_typehints = 'none'
+
+# The not found page
+notfound_template = '404.html'
+
+
+def builder_inited_handler(app):
+ if subprocess.run(['./stat.py']).returncode != 0:
+ raise RuntimeError('Failed to run the script `stat.py`.')
+
+
+def setup(app):
+ app.add_config_value('no_underscore_emphasis', False, 'env')
+ app.connect('builder-inited', builder_inited_handler)
diff --git a/docs/zh_CN/device/npu.md b/docs/zh_CN/device/npu.md
new file mode 100644
index 0000000000000000000000000000000000000000..b81c175117be5ce0fc6925ab96d2a5f517b602b4
--- /dev/null
+++ b/docs/zh_CN/device/npu.md
@@ -0,0 +1,41 @@
+# NPU (华为昇腾)
+
+## 使用方法
+
+首先,请参考[链接](https://mmcv.readthedocs.io/zh_CN/latest/get_started/build.html#npu-mmcv-full)安装带有 NPU 支持的 MMCV 和[链接](https://mmengine.readthedocs.io/en/latest/get_started/installation.html#build-from-source)安装 MMEngine。
+
+使用如下命令,可以利用 8 个 NPU 在机器上训练模型(以 ResNet 为例):
+
+```shell
+bash tools/dist_train.sh configs/cspnet/resnet50_8xb32_in1k.py 8
+```
+
+或者,使用如下命令,在一个 NPU 上训练模型(以 ResNet 为例):
+
+```shell
+python tools/train.py configs/cspnet/resnet50_8xb32_in1k.py
+```
+
+## 经过验证的模型
+
+| Model | Top-1 (%) | Top-5 (%) | Config | Download |
+| :---------------------------------------------------------: | :-------: | :-------: | :----------------------------------------------------------: | :-------------------------------------------------------------: |
+| [ResNet-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/README.md) | 76.40 | 93.21 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnet50_8xb32_in1k.log) |
+| [ResNetXt-32x4d-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/README.md) | 77.48 | 93.75 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/resnext50-32x4d_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnext50-32x4d_8xb32_in1k.log) |
+| [HRNet-W18](https://github.com/open-mmlab/mmclassification/blob/master/configs/hrnet/README.md) | 77.06 | 93.57 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w18_4xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/hrnet-w18_4xb32_in1k.log) |
+| [ResNetV1D-152](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/README.md) | 79.41 | 94.48 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1d152_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnetv1d152_8xb32_in1k.log) |
+| [SE-ResNet-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/README.md) | 77.65 | 93.74 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/seresnet50_8xb32_in1k.py) | [model](<>) \|[log](https://download.openmmlab.com/mmclassification/v1/device/npu/seresnet50_8xb32_in1k.log) |
+| [ShuffleNetV2 1.0x](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v2/README.md) | 69.52 | 88.79 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/shufflenet-v2-1x_16xb64_in1k.log) |
+| [MobileNetV2](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/mobilenet_v2) | 71.74 | 90.28 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/mobilenet-v2_8xb32_in1k.log) |
+| [MobileNetV3-Small](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/README.md) | 67.09 | 87.17 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/mobilenet-v3-small.log) |
+| [\*CSPResNeXt50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/README.md) | 77.25 | 93.46 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/cspresnext50_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/cspresnext50_8xb32_in1k.log) |
+| [\*EfficientNet-B4](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/README.md) | 75.73 | 92.9100 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-b4_8xb32_in1k.py) | [model](<>) \|[log](https://download.openmmlab.com/mmclassification/v1/device/npu/efficientnet-b4_8xb32_in1k.log) |
+| [\*\*DenseNet121](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/README.md) | 72.53 | 90.85 | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/densenet121_4xb256_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/densenet121_4xb256_in1k.log) |
+
+**注意:**
+
+- 如果没有特别标记,NPU 上的结果与使用 FP32 的 GPU 上的结果结果相同。
+- (\*) 这些模型的训练结果低于相应模型中自述文件上的结果,主要是因为自述文件上的结果直接是 timm 训练得出的权重,而这边的结果是根据 mmcls 的配置重新训练得到的结果。GPU 上的配置训练结果与 NPU 的结果相同。
+- (\*\*)这个模型的精度略低,因为 config 是 4 张卡的配置,我们使用 8 张卡来运行,用户可以调整超参数以获得最佳精度结果。
+
+**以上所有模型权重及训练日志均由华为昇腾团队提供**
diff --git a/docs/zh_CN/docutils.conf b/docs/zh_CN/docutils.conf
new file mode 100644
index 0000000000000000000000000000000000000000..0c00c84688701117f231fd0c8ec295fb747b7d8f
--- /dev/null
+++ b/docs/zh_CN/docutils.conf
@@ -0,0 +1,2 @@
+[html writers]
+table_style: colwidths-auto
diff --git a/docs/zh_CN/get_started.md b/docs/zh_CN/get_started.md
new file mode 100644
index 0000000000000000000000000000000000000000..0cf252f1f4f2beef0d6f2879f1a166b0dde5ae0c
--- /dev/null
+++ b/docs/zh_CN/get_started.md
@@ -0,0 +1,163 @@
+# 依赖环境
+
+在本节中,我们将演示如何准备 PyTorch 相关的依赖环境。
+
+MMPretrain 适用于 Linux、Windows 和 macOS。它需要 Python 3.7+、CUDA 10.2+ 和 PyTorch 1.8+。
+
+```{note}
+如果你对配置 PyTorch 环境已经很熟悉,并且已经完成了配置,可以直接进入[下一节](#安装)。
+否则的话,请依照以下步骤完成配置。
+```
+
+**第 1 步** 从[官网](https://docs.conda.io/en/latest/miniconda.html)下载并安装 Miniconda。
+
+**第 2 步** 创建一个 conda 虚拟环境并激活它。
+
+```shell
+conda create --name openmmlab python=3.8 -y
+conda activate openmmlab
+```
+
+**第 3 步** 按照[官方指南](https://pytorch.org/get-started/locally/)安装 PyTorch。例如:
+
+在 GPU 平台:
+
+```shell
+conda install pytorch torchvision -c pytorch
+```
+
+```{warning}
+以上命令会自动安装最新版的 PyTorch 与对应的 cudatoolkit,请检查它们是否与你的环境匹配。
+```
+
+在 CPU 平台:
+
+```shell
+conda install pytorch torchvision cpuonly -c pytorch
+```
+
+# 安装
+
+我们推荐用户按照我们的最佳实践来安装 MMPretrain。但除此之外,如果你想根据
+你的习惯完成安装流程,也可以参见[自定义安装](#自定义安装)一节来获取更多信息。
+
+## 最佳实践
+
+根据具体需求,我们支持两种安装模式:
+
+- [从源码安装(推荐)](#从源码安装):希望基于 MMPretrain 框架开发自己的预训练任务,需要添加新的功能,比如新的模型或是数据集,或者使用我们提供的各种工具。
+- [作为 Python 包安装](#作为-python-包安装):只是希望调用 MMPretrain 的 API 接口,或者在自己的项目中导入 MMPretrain 中的模块。
+
+### 从源码安装
+
+这种情况下,从源码按如下方式安装 mmpretrain:
+
+```shell
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+pip install -U openmim && mim install -e .
+```
+
+```{note}
+`"-e"` 表示以可编辑形式安装,这样可以在不重新安装的情况下,让本地修改直接生效
+```
+
+### 作为 Python 包安装
+
+直接使用 mim 安装即可。
+
+```shell
+pip install -U openmim && mim install "mmpretrain>=1.0.0rc8"
+```
+
+```{note}
+`mim` 是一个轻量级的命令行工具,可以根据 PyTorch 和 CUDA 版本为 OpenMMLab 算法库配置合适的环境。同时它也提供了一些对于深度学习实验很有帮助的功能。
+```
+
+## 安装多模态支持 (可选)
+
+MMPretrain 中的多模态模型需要额外的依赖项,要安装这些依赖项,请在安装过程中添加 `[multimodal]` 参数,如下所示:
+
+```shell
+# 从源码安装
+mim install -e ".[multimodal]"
+
+# 作为 Python 包安装
+mim install "mmpretrain[multimodal]>=1.0.0rc8"
+```
+
+## 验证安装
+
+为了验证 MMPretrain 的安装是否正确,我们提供了一些示例代码来执行模型推理。
+
+如果你是**从源码安装**的 mmpretrain,那么直接运行以下命令进行验证:
+
+```shell
+python demo/image_demo.py demo/demo.JPEG resnet18_8xb32_in1k --device cpu
+```
+
+你可以看到命令行中输出了结果字典,包括 `pred_label`,`pred_score` 和 `pred_class` 三个字段。
+
+如果你是**作为 Python 包安装**,那么可以打开你的 Python 解释器,并粘贴如下代码:
+
+```python
+from mmpretrain import get_model, inference_model
+
+model = get_model('resnet18_8xb32_in1k', device='cpu') # 或者 device='cuda:0'
+inference_model(model, 'demo/demo.JPEG')
+```
+
+你会看到输出一个字典,包含预测的标签、得分及类别名。
+
+```{note}
+以上示例中,`resnet18_8xb32_in1k` 是模型名称。你可以使用 [`mmpretrain.list_models`](mmpretrain.apis.list_models) 接口来
+浏览所有的模型,或者在[模型汇总](./modelzoo_statistics.md)页面进行查找。
+```
+
+## 自定义安装
+
+### CUDA 版本
+
+安装 PyTorch 时,需要指定 CUDA 版本。如果您不清楚选择哪个,请遵循我们的建议:
+
+- 对于 Ampere 架构的 NVIDIA GPU,例如 GeForce 30 series 以及 NVIDIA A100,CUDA 11 是必需的。
+- 对于更早的 NVIDIA GPU,CUDA 11 是向前兼容的,但 CUDA 10.2 能够提供更好的兼容性,也更加轻量。
+
+请确保你的 GPU 驱动版本满足最低的版本需求,参阅[这张表](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions__table-cuda-toolkit-driver-versions)。
+
+```{note}
+如果按照我们的最佳实践进行安装,CUDA 运行时库就足够了,因为我们提供相关 CUDA 代码的预编译,你不需要进行本地编译。
+但如果你希望从源码进行 MMCV 的编译,或是进行其他 CUDA 算子的开发,那么就必须安装完整的 CUDA 工具链,参见
+[NVIDIA 官网](https://developer.nvidia.com/cuda-downloads),另外还需要确保该 CUDA 工具链的版本与 PyTorch 安装时
+的配置相匹配(如用 `conda install` 安装 PyTorch 时指定的 cudatoolkit 版本)。
+```
+
+### 在 CPU 环境中安装
+
+MMPretrain 可以仅在 CPU 环境中安装,在 CPU 模式下,你可以完成训练、测试和模型推理等所有操作。
+
+### 在 Google Colab 中安装
+
+参考 [Colab 教程](https://colab.research.google.com/github/mzr1996/mmclassification-tutorial/blob/master/1.x/MMClassification_tools.ipynb) 安装即可。
+
+### 通过 Docker 使用 MMPretrain
+
+MMPretrain 提供 [Dockerfile](https://github.com/open-mmlab/mmpretrain/blob/main/docker/Dockerfile)
+用于构建镜像。请确保你的 [Docker 版本](https://docs.docker.com/engine/install/) >=19.03。
+
+```shell
+# 构建默认的 PyTorch 1.12.1,CUDA 11.3 版本镜像
+# 如果你希望使用其他版本,请修改 Dockerfile
+docker build -t mmpretrain docker/
+```
+
+用以下命令运行 Docker 镜像:
+
+```shell
+docker run --gpus all --shm-size=8g -it -v {DATA_DIR}:/mmpretrain/data mmpretrain
+```
+
+## 故障解决
+
+如果你在安装过程中遇到了什么问题,请先查阅[常见问题](./notes/faq.md)。如果没有找到解决方法,可以在 GitHub
+上[提出 issue](https://github.com/open-mmlab/mmpretrain/issues/new/choose)。
diff --git a/docs/zh_CN/index.rst b/docs/zh_CN/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ca57faacfe3e512cd80f35559e692c2a92c1a36c
--- /dev/null
+++ b/docs/zh_CN/index.rst
@@ -0,0 +1,150 @@
+欢迎来到 MMPretrain 中文教程!
+==========================================
+
+MMPretrain 是一个全新升级的预训练开源算法框架,旨在提供各种强大的预训练主干网络,
+并支持了不同的预训练策略。MMPretrain 源自著名的开源项目
+`MMClassification `_
+和 `MMSelfSup `_,并开发了许多令人兴奋的新功能。
+目前,预训练阶段对于视觉识别至关重要,凭借丰富而强大的预训练模型,我们能够改进各种下游视觉任务。
+
+我们的代码库旨在成为一个易于使用和用户友好的代码库库,并简化学术研究活动和工程任务。
+我们在以下不同部分中详细介绍了 MMPretrain 的特性和设计。
+
+MMPretrain 上手路线
+-------------------------------
+
+为了用户能够快速上手,我们推荐以下流程:
+
+ - 对于想要使用 MMPretrain 的用户,我们推荐先阅读 开始你的第一步_ 部分来设置环境。
+
+ - 对于一些基础使用,我们建议用户阅读 教程_ 来学习如何使用算法库来获得预训练模型以及在下游任务进行评测。
+
+ - 若您想进行算法的自定义,我们提供了 进阶教程_ 来阐述了代码修改的方法和规则。
+
+ - 如果您想找到所期望的预训练模型,您可以浏览 模型库_,其中包含了模型库的总结,以及各类主干网络和预训练算法的介绍。
+
+ - 我们同样提供了 分析工具_ 和 可视化_ 来辅助模型分析。
+
+ - 另外,如果您还有其它问题,欢迎查阅 其他说明_,也许可以找到您想要的答案。
+
+我们始终非常欢迎用户的 PRs 和 Issues 来完善 MMPretrain!
+
+.. _开始你的第一步:
+.. toctree::
+ :maxdepth: 1
+ :caption: 开始你的第一步
+
+ get_started.md
+
+.. _教程:
+.. toctree::
+ :maxdepth: 1
+ :caption: 教程
+
+ user_guides/config.md
+ user_guides/dataset_prepare.md
+ user_guides/inference.md
+ user_guides/train.md
+ user_guides/test.md
+ user_guides/downstream.md
+
+.. _进阶教程:
+.. toctree::
+ :maxdepth: 1
+ :caption: 进阶教程
+
+ advanced_guides/datasets.md
+ advanced_guides/pipeline.md
+ advanced_guides/modules.md
+ advanced_guides/schedule.md
+ advanced_guides/runtime.md
+ advanced_guides/evaluation.md
+ advanced_guides/convention.md
+
+.. _模型库:
+.. toctree::
+ :maxdepth: 1
+ :caption: 模型库
+ :glob:
+
+ modelzoo_statistics.md
+ papers/*
+
+.. _可视化:
+.. toctree::
+ :maxdepth: 1
+ :caption: 可视化
+
+ useful_tools/dataset_visualization.md
+ useful_tools/scheduler_visualization.md
+ useful_tools/cam_visualization.md
+ useful_tools/t-sne_visualization.md
+
+.. _分析工具:
+.. toctree::
+ :maxdepth: 1
+ :caption: 分析工具
+
+ useful_tools/print_config.md
+ useful_tools/verify_dataset.md
+ useful_tools/log_result_analysis.md
+ useful_tools/complexity_analysis.md
+ useful_tools/confusion_matrix.md
+ useful_tools/shape_bias.md
+
+.. toctree::
+ :maxdepth: 1
+ :caption: 部署
+
+ useful_tools/model_serving.md
+
+.. toctree::
+ :maxdepth: 1
+ :caption: 迁移指南
+
+ migration.md
+
+.. toctree::
+ :maxdepth: 1
+ :caption: API 参考文档
+
+ mmpretrain.apis
+ mmpretrain.engine
+ mmpretrain.datasets
+ 数据处理
+ mmpretrain.models
+ mmpretrain.structures
+ mmpretrain.visualization
+ mmpretrain.evaluation
+ mmpretrain.utils
+
+.. _其他说明:
+.. toctree::
+ :maxdepth: 1
+ :caption: 其他说明
+
+ notes/contribution_guide.md
+ notes/projects.md
+ notes/changelog.md
+ notes/faq.md
+ notes/pretrain_custom_dataset.md
+ notes/finetune_custom_dataset.md
+
+.. toctree::
+ :maxdepth: 1
+ :caption: 设备支持
+
+ device/npu.md
+
+.. toctree::
+ :caption: 切换语言
+
+ English
+ 简体中文
+
+
+索引与表格
+==================
+
+* :ref:`genindex`
+* :ref:`search`
diff --git a/docs/zh_CN/locales/zh_CN/LC_MESSAGES/api.po b/docs/zh_CN/locales/zh_CN/LC_MESSAGES/api.po
new file mode 100644
index 0000000000000000000000000000000000000000..abfc40d0c3d1b8da4d49f9ecc28b5c53a9e10f83
--- /dev/null
+++ b/docs/zh_CN/locales/zh_CN/LC_MESSAGES/api.po
@@ -0,0 +1,9090 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2020, OpenMMLab
+# This file is distributed under the same license as the MMClassification
+# package.
+# FIRST AUTHOR , 2021.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: MMClassification\n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-11-22 08:42+0800\n"
+"PO-Revision-Date: 2022-11-22 15:18+0800\n"
+"Last-Translator: Ma Zerun \n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.1\n"
+"Language-Team: \n"
+"Language: zh_CN\n"
+"X-Generator: Poedit 2.3\n"
+
+#: ../../api/apis.rst:7 ../../api/apis.rst:14
+msgid "mmcls.apis"
+msgstr ""
+
+#: ../../api/apis.rst:9
+msgid "These are some high-level APIs for classification tasks."
+msgstr "该包提供了一些用于分类任务的高阶 API"
+
+#: ../../api/apis.rst:17
+msgid "Inference"
+msgstr "推理"
+
+#: ../../api/apis.rst:24::1
+msgid ":py:obj:`init_model `"
+msgstr ""
+
+#: ../../api/apis.rst:24::1 mmcls.apis.inference.init_model:1 of
+msgid "Initialize a classifier from config file."
+msgstr "从配置文件初始化一个分类器"
+
+#: ../../api/apis.rst:24::1
+msgid ":py:obj:`inference_model `"
+msgstr ""
+
+#: ../../api/apis.rst:24::1 mmcls.apis.inference.inference_model:1 of
+msgid "Inference image(s) with the classifier."
+msgstr "使用分类器推理图像"
+
+#: ../../api/data_process.rst:5
+msgid "Data Process"
+msgstr "数据处理"
+
+#: ../../api/data_process.rst:7
+msgid ""
+"In MMClassification, the data process and the dataset is decomposed. The datasets only define how to get "
+"samples' basic information from the file system. These basic information includes the ground-truth label "
+"and raw images data / the paths of images.The data process includes data transforms, data preprocessors and "
+"batch augmentations."
+msgstr ""
+"在 MMClassification 中,数据处理和数据集是解耦的。数据集只定义了如何从文件系统中获取样本的基本信息。这些基本"
+"信息包括分类标签和原始图像数据/图像的路径。完整的数据处理流程包括了数据变换(data transform)、数据预处理器"
+"(data preprocessor)及批量数据增强(batch augmentation)。"
+
+#: ../../api/data_process.rst:13
+msgid ""
+":mod:`Data Transforms `: Transforms includes loading, preprocessing, formatting "
+"and etc."
+msgstr ""
+":mod:`数据变换 `:数据变换包括了数据的加载、部分预处理/增强、数据格式化等操作"
+
+#: ../../api/data_process.rst:14
+msgid ""
+":mod:`Data Preprocessors `: Processes includes collate, "
+"normalization, stacking, channel fliping and etc."
+msgstr ""
+":mod:`数据预处理器 `:主要负责批量数据的收集、归一化、堆叠、通道翻转等"
+"操作。"
+
+#: ../../api/data_process.rst:16
+msgid ""
+":mod:`Batch Augmentations `: Batch augmentation involves multiple "
+"samples, such as Mixup and CutMix."
+msgstr ""
+":mod:`批量数据增强 `:批量数据增强是数据预处理器的功能之一,负责处理涉及"
+"多个样本的数据增强操作,例如 Mixup 和 CutMix。"
+
+#: ../../api/data_process.rst:21
+msgid "Data Transforms"
+msgstr "数据变换"
+
+#: ../../api/data_process.rst:23
+msgid ""
+"To prepare the inputs data, we need to do some transforms on these basic information. These transforms "
+"includes loading, preprocessing and formatting. And a series of data transforms makes up a data pipeline. "
+"Therefore, you can find the a ``pipeline`` argument in the configs of dataset, for example:"
+msgstr ""
+"为了准备输入数据,我们需要对数据集中保存的基本信息做一些变换。这些变换包括数据加载、部分预处理和增强、格式"
+"化。一系列的数据变换组成了数据流水线(data pipeline)。因此,在数据集的配置参数中通常存在一个 ``pipeline`` "
+"参数,例如:"
+
+#: ../../api/data_process.rst:46
+msgid ""
+"Every item of a pipeline list is one of the following data transforms class. And if you want to add a "
+"custom data transformation class, the tutorial :doc:`Custom Data Pipelines ` "
+"will help you."
+msgstr ""
+"``pipeline`` 列表中的每一项都是以下数据变换类之一。如果您想添加自定义数据变换类,可以参考 :doc:`自定义数据流"
+"水线教程 `。"
+
+#: ../../api/data_process.rst:54
+msgid "Processing and Augmentation"
+msgstr "组合式增强"
+
+#: ../../api/data_process.rst:70::1
+msgid ":py:obj:`Albumentations `"
+msgstr ""
+
+#: ../../api/data_process.rst:70::1 mmcls.datasets.transforms.processing.Albumentations:1 of
+msgid "Wrapper to use augmentation from albumentations library."
+msgstr "使用 Albumentations 库进行数据变换的封装类"
+
+#: ../../api/data_process.rst:70::1
+msgid ":py:obj:`ColorJitter `"
+msgstr ""
+
+#: ../../api/data_process.rst:70::1 mmcls.datasets.transforms.processing.ColorJitter:1 of
+msgid "Randomly change the brightness, contrast and saturation of an image."
+msgstr "随机改变图像的亮度、对比度和饱和度"
+
+#: ../../api/data_process.rst:70::1
+msgid ":py:obj:`EfficientNetCenterCrop `"
+msgstr ""
+
+#: ../../api/data_process.rst:70::1 mmcls.datasets.transforms.processing.EfficientNetCenterCrop:1
+#: of
+msgid "EfficientNet style center crop."
+msgstr "EfficientNet 风格的中心裁剪"
+
+#: ../../api/data_process.rst:70::1
+msgid ":py:obj:`EfficientNetRandomCrop `"
+msgstr ""
+
+#: ../../api/data_process.rst:70::1 mmcls.datasets.transforms.processing.EfficientNetRandomCrop:1
+#: of
+msgid "EfficientNet style RandomResizedCrop."
+msgstr "EfficientNet 风格的随机缩放裁剪"
+
+#: ../../api/data_process.rst:70::1
+msgid ":py:obj:`Lighting `"
+msgstr ""
+
+#: ../../api/data_process.rst:70::1 mmcls.datasets.transforms.processing.Lighting:1 of
+msgid "Adjust images lighting using AlexNet-style PCA jitter."
+msgstr "使用 AlexNet 风格的 PCA 抖动随机调整图像照明"
+
+#: ../../api/data_process.rst:70::1
+msgid ":py:obj:`RandomCrop `"
+msgstr ""
+
+#: ../../api/data_process.rst:70::1 mmcls.datasets.transforms.processing.RandomCrop:1 of
+msgid "Crop the given Image at a random location."
+msgstr "在随机位置裁剪给定图像"
+
+#: ../../api/data_process.rst:70::1
+msgid ":py:obj:`RandomErasing `"
+msgstr ""
+
+#: ../../api/data_process.rst:70::1 mmcls.datasets.transforms.processing.RandomErasing:1 of
+msgid "Randomly selects a rectangle region in an image and erase pixels."
+msgstr "在图像中随机选择一个矩形区域并擦除像素"
+
+#: ../../api/data_process.rst:70::1
+msgid ":py:obj:`RandomResizedCrop `"
+msgstr ""
+
+#: ../../api/data_process.rst:70::1 mmcls.datasets.transforms.processing.RandomResizedCrop:1 of
+msgid "Crop the given image to random scale and aspect ratio."
+msgstr "将给定图像按照随机尺寸和纵横比进行裁剪"
+
+#: ../../api/data_process.rst:70::1
+msgid ":py:obj:`ResizeEdge `"
+msgstr ""
+
+#: ../../api/data_process.rst:70::1 mmcls.datasets.transforms.processing.ResizeEdge:1 of
+msgid "Resize images along the specified edge."
+msgstr "按照指定边长调整图像尺寸"
+
+#: ../../api/data_process.rst:72
+msgid "Composed Augmentation"
+msgstr "组合式增强"
+
+#: ../../api/data_process.rst:73
+msgid ""
+"Composed augmentation is a kind of methods which compose a series of data augmentation transforms, such as "
+"``AutoAugment`` and ``RandAugment``."
+msgstr ""
+"组合式增强将一系列数据增强方法组合在一起,实现对样本的整体增强,例如 ``AutoAugment`` 和 ``RandAugment``"
+
+#: ../../api/data_process.rst:83::1
+msgid ":py:obj:`AutoAugment `"
+msgstr ""
+
+#: ../../api/data_process.rst:83::1 mmcls.datasets.transforms.auto_augment.AutoAugment:1 of
+msgid "Auto augmentation."
+msgstr ""
+
+#: ../../api/data_process.rst:83::1
+msgid ":py:obj:`RandAugment `"
+msgstr ""
+
+#: ../../api/data_process.rst:83::1 mmcls.datasets.transforms.auto_augment.RandAugment:1 of
+msgid "Random augmentation."
+msgstr ""
+
+#: ../../api/data_process.rst:84
+msgid ""
+"To specify the augmentation combination (The ``policies`` argument), you can use string to specify from "
+"some preset policies."
+msgstr "为了指定增强组合的策略(即上述变换中的 ``policies`` 参数),你可以使用字符串从一系列预设策略中指定。"
+
+#: ../../api/data_process.rst:91
+msgid "Preset policy"
+msgstr "预设策略"
+
+#: ../../api/data_process.rst:92
+msgid "Use for"
+msgstr "用于"
+
+#: ../../api/data_process.rst:93
+msgid "Description"
+msgstr "说明"
+
+#: ../../api/data_process.rst:94
+msgid "\"imagenet\""
+msgstr ""
+
+#: ../../api/data_process.rst:95
+msgid ":class:`AutoAugment`"
+msgstr ""
+
+#: ../../api/data_process.rst:96
+msgid "Policy for ImageNet, come from `DeepVoltaire/AutoAugment`_"
+msgstr "用于 ImageNet 数据集的增强组合,来自 `DeepVoltaire/AutoAugment`_ 仓库"
+
+#: ../../api/data_process.rst:97
+msgid "\"timm_increasing\""
+msgstr ""
+
+#: ../../api/data_process.rst:98
+msgid ":class:`RandAugment`"
+msgstr ""
+
+#: ../../api/data_process.rst:99
+msgid "The ``_RAND_INCREASING_TRANSFORMS`` policy from `timm`_"
+msgstr "`timm`_ 仓库中的 ``_RAND_INCREASING_TRANSFORMS`` 增强组合"
+
+#: ../../api/data_process.rst:104
+msgid "And you can also configure a group of policies manually by selecting from the below table."
+msgstr "你还可以通过根据下表手动配置一组策略。"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`AutoContrast `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.AutoContrast:1 of
+msgid "Auto adjust image contrast."
+msgstr "自动调整图像对比度"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Brightness `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Brightness:1 of
+msgid "Adjust images brightness."
+msgstr "自动调整图像亮度"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`ColorTransform `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.ColorTransform:1 of
+msgid "Adjust images color balance."
+msgstr "自动调整图像平衡"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Contrast `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Contrast:1 of
+msgid "Adjust images contrast."
+msgstr "改变图像对比度"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Cutout `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Cutout:1 of
+msgid "Cutout images."
+msgstr "擦除部分图像区域"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Equalize `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Equalize:1 of
+msgid "Equalize the image histogram."
+msgstr "均衡化图像直方图"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Invert `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Invert:1 of
+msgid "Invert images."
+msgstr "反转图像色阶"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Posterize `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Posterize:1 of
+msgid "Posterize images (reduce the number of bits for each color channel)."
+msgstr "图像像素化(降低各色彩通道的比特数)"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Rotate `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Rotate:1 of
+msgid "Rotate images."
+msgstr "旋转图像"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Sharpness `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Sharpness:1 of
+msgid "Adjust images sharpness."
+msgstr "改变图像锐度"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Shear `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Shear:1 of
+msgid "Shear images."
+msgstr "图像切变"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Solarize `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Solarize:1 of
+msgid "Solarize images (invert all pixel values above a threshold)."
+msgstr "图像日光化(反转高于某一阈值的所有图像色阶)"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`SolarizeAdd `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.SolarizeAdd:1 of
+msgid "SolarizeAdd images (add a certain value to pixels below a threshold)."
+msgstr "图像过曝(为低于某一阈值的所有色阶增加一个固定值)"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`Translate `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.Translate:1 of
+msgid "Translate images."
+msgstr "平移图像"
+
+#: ../../api/data_process.rst:126::1
+msgid ":py:obj:`BaseAugTransform `"
+msgstr ""
+
+#: ../../api/data_process.rst:126::1 mmcls.datasets.transforms.auto_augment.BaseAugTransform:1 of
+msgid "The base class of augmentation transform for RandAugment."
+msgstr "用于组合式增强的数据变换基类"
+
+#: ../../api/data_process.rst:128
+msgid "Formatting"
+msgstr "格式化"
+
+#: ../../api/data_process.rst:141::1
+msgid ":py:obj:`Collect `"
+msgstr ""
+
+#: ../../api/data_process.rst:141::1 mmcls.datasets.transforms.formatting.Collect:1 of
+msgid "Collect and only reserve the specified fields."
+msgstr "收集并仅保留指定字段的数据"
+
+#: ../../api/data_process.rst:141::1
+msgid ":py:obj:`PackClsInputs `"
+msgstr ""
+
+#: ../../api/data_process.rst:141::1 mmcls.datasets.transforms.formatting.PackClsInputs:1 of
+msgid "Pack the inputs data for the classification."
+msgstr "将输入数据整理成为用于分类任务的数据格式。"
+
+#: ../../api/data_process.rst:141::1
+msgid ":py:obj:`ToNumpy `"
+msgstr ""
+
+#: ../../api/data_process.rst:141::1 mmcls.datasets.transforms.formatting.ToNumpy:1 of
+msgid "Convert object to :obj:`numpy.ndarray`."
+msgstr "将对象转变为 :obj:`numpy.ndarray`"
+
+#: ../../api/data_process.rst:141::1
+msgid ":py:obj:`ToPIL `"
+msgstr ""
+
+#: ../../api/data_process.rst:141::1 mmcls.datasets.transforms.formatting.ToPIL:1 of
+msgid "Convert the image from OpenCV format to :obj:`PIL.Image.Image`."
+msgstr "将图片从 OpenCV 格式转为为 :obj:`PIL.Image.Image` 格式"
+
+#: ../../api/data_process.rst:141::1
+msgid ":py:obj:`Transpose `"
+msgstr ""
+
+#: ../../api/data_process.rst:141::1 mmcls.datasets.transforms.formatting.Transpose:1 of
+msgid "Transpose numpy array."
+msgstr "转置 NumPy 数组"
+
+#: ../../api/data_process.rst:143
+msgid "MMCV transforms"
+msgstr "MMCV 中的数据变换"
+
+#: ../../api/data_process.rst:145
+msgid ""
+"We also provides many transforms in MMCV. You can use them directly in the config files. Here are some "
+"frequently used transforms, and the whole transforms list can be found in :external+mmcv:doc:`api/"
+"transforms`."
+msgstr ""
+"我们还在 MMCV 中提供了很多数据转换类。你可以在配置文件中直接使用它们。这里我们列举了一些常用的数据变换类,完"
+"整的数据变换类列表可以在 :external+mmcv:doc:`api/transforms` 中找到。"
+
+#: ../../api/data_process.rst:150
+msgid ":external:class:`~mmcv.transforms.LoadImageFromFile`"
+msgstr ""
+
+#: ../../api/data_process.rst:151
+msgid "Load an image from file."
+msgstr "从图片路径加载图片"
+
+#: ../../api/data_process.rst:152
+msgid ":external:class:`~mmcv.transforms.Resize`"
+msgstr ""
+
+#: ../../api/data_process.rst:153
+msgid "Resize images & bbox & seg & keypoints."
+msgstr "缩放图像、bbox、分割图、关键点等"
+
+#: ../../api/data_process.rst:154
+msgid ":external:class:`~mmcv.transforms.RandomResize`"
+msgstr ""
+
+#: ../../api/data_process.rst:155
+msgid "Random resize images & bbox & keypoints."
+msgstr "随机缩放图像、bbox、关键点等"
+
+#: ../../api/data_process.rst:156
+msgid ":external:class:`~mmcv.transforms.RandomFlip`"
+msgstr ""
+
+#: ../../api/data_process.rst:157
+msgid "Flip the image & bbox & keypoints & segmentation map."
+msgstr "随机翻转图像、bbox、关键点等"
+
+#: ../../api/data_process.rst:158
+msgid ":external:class:`~mmcv.transforms.RandomGrayscale`"
+msgstr ""
+
+#: ../../api/data_process.rst:159
+msgid "Randomly convert image to grayscale with a probability."
+msgstr "随机灰度化图像"
+
+#: ../../api/data_process.rst:160
+msgid ":external:class:`~mmcv.transforms.CenterCrop`"
+msgstr ""
+
+#: ../../api/data_process.rst:161
+msgid ""
+"Crop the center of the image, segmentation masks, bounding boxes and key points. If the crop area exceeds "
+"the original image and ``auto_pad`` is True, the original image will be padded before cropping."
+msgstr ""
+"裁剪一张图像的中心区域(同时处理分割图、bbox、关键点等)。如果裁剪尺寸超出原图区域,并且指定了 "
+"``auto_pad=True``,则会在裁剪之前扩充原图至合适大小"
+
+#: ../../api/data_process.rst:162
+msgid ":external:class:`~mmcv.transforms.Normalize`"
+msgstr ""
+
+#: ../../api/data_process.rst:163
+msgid "Normalize the image."
+msgstr "归一化图像"
+
+#: ../../api/data_process.rst:164
+msgid ":external:class:`~mmcv.transforms.Compose`"
+msgstr ""
+
+#: ../../api/data_process.rst:165
+msgid "Compose multiple transforms sequentially."
+msgstr "顺序组合一系列数据变换"
+
+#: ../../api/data_process.rst:170
+msgid "Data Preprocessors"
+msgstr "数据预处理器"
+
+#: ../../api/data_process.rst:172
+msgid ""
+"The data preprocessor is also a component to process the data before feeding data to the neural network. "
+"Comparing with the data transforms, the data preprocessor is a module of the classifier, and it takes a "
+"batch of data to process, which means it can use GPU and batch to accelebrate the processing."
+msgstr ""
+"数据预处理器也是在数据进入神经网络之前,对数据进行处理的组件。与数据变换相比,数据预处理器是模型的一个的模"
+"块,并且可以获得一个批次的数据进行处理,这意味着它可以使用模型所在的设备(如 GPU),并利用批量处理,实现加"
+"速。"
+
+#: ../../api/data_process.rst:176
+msgid "The default data preprocessor in MMClassification could do the pre-processing like following:"
+msgstr "MMClassification 中使用的默认的数据预处理器可以进行以下操作:"
+
+#: ../../api/data_process.rst:178
+msgid "Move data to the target device."
+msgstr "将数据移动到模型所在的设备"
+
+#: ../../api/data_process.rst:179
+msgid "Pad inputs to the maximum size of current batch."
+msgstr "将不同尺寸的输入填充至统一的尺寸"
+
+#: ../../api/data_process.rst:180
+msgid "Stack inputs to a batch."
+msgstr "将一系列输入的 tensor 组成 batch"
+
+#: ../../api/data_process.rst:181 mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:16 of
+msgid "Convert inputs from bgr to rgb if the shape of input is (3, H, W)."
+msgstr "如果输入的 tensor 形状为 (3, H, W),则可以执行 BGR 到 RGB 的通道转换"
+
+#: ../../api/data_process.rst:182 mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:17 of
+msgid "Normalize image with defined std and mean."
+msgstr "根据给定的均值和方差对图像进行归一化"
+
+#: ../../api/data_process.rst:183
+msgid "Do batch augmentations like Mixup and CutMix during training."
+msgstr "在训练时进行批量数据增强,如 Mixup 和 CutMix"
+
+#: ../../api/data_process.rst:185
+msgid ""
+"You can configure the data preprocessor by the ``data_preprocessor`` field or ``model.data_preprocessor`` "
+"field in the config file. Typical usages are as below:"
+msgstr ""
+"你可以在配置文件的 ``data_preprocessor`` 字段,或是 ``model.data_preprocessor`` 字段对数据预处理器进行配置。"
+"一个典型的用法如下:"
+
+#: ../../api/data_process.rst:196
+msgid "Or define in ``model.data_preprocessor`` as following:"
+msgstr "或者在 ``model.data_preprocessor`` 字段配置如下:"
+
+#: ../../api/data_process.rst:211
+msgid "Note that the ``model.data_preprocessor`` has higher priority than ``data_preprocessor``."
+msgstr "请注意如果在两处均进行了配置,``model.data_preprocessor`` 拥有更高的优先级。"
+
+#: ../../api/data_process.rst:219::1
+msgid ":py:obj:`ClsDataPreprocessor `"
+msgstr ""
+
+#: ../../api/data_process.rst:219::1 mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:1
+#: of
+msgid "Image pre-processor for classification tasks."
+msgstr "用于分类任务的图像预处理器"
+
+#: ../../api/data_process.rst:223
+msgid "Batch Augmentations"
+msgstr "批量数据增强"
+
+#: ../../api/data_process.rst:225
+msgid ""
+"The batch augmentation is a component of data preprocessors. It involves multiple samples and mix them in "
+"some way, such as Mixup and CutMix."
+msgstr ""
+"批量数据增强是数据预处理器的一个功能。它可以利用一个批次的多个样本,以某种方式进行混合增强,如 Mixup 和 "
+"CutMix。"
+
+#: ../../api/data_process.rst:227
+msgid ""
+"These augmentations are usually only used during training, therefore, we use the ``model.train_cfg`` field "
+"to configure them in config files."
+msgstr "这些数据增强只会在训练过程中生效,因此,我们使用 ``model.train_cfg`` 字段来配置这些功能。"
+
+#: ../../api/data_process.rst:241
+msgid "You can also specify the probabilities of every batch augmentation by the ``probs`` field."
+msgstr "你也可以通过 ``probs`` 字段指定每一个批量数据增强的概率。"
+
+#: ../../api/data_process.rst:255
+msgid "Here is a list of batch augmentations can be used in MMClassification."
+msgstr "这里是 MMClassification 中支持的所有批量数据增强列表。"
+
+#: ../../api/data_process.rst:264::1
+msgid ":py:obj:`Mixup `"
+msgstr ""
+
+#: ../../api/data_process.rst:264::1 mmcls.models.utils.batch_augments.mixup.Mixup:1 of
+msgid "Mixup batch augmentation."
+msgstr ""
+
+#: ../../api/data_process.rst:264::1
+msgid ":py:obj:`CutMix `"
+msgstr ""
+
+#: ../../api/data_process.rst:264::1 mmcls.models.utils.batch_augments.cutmix.CutMix:1 of
+msgid "CutMix batch agumentation."
+msgstr ""
+
+#: ../../api/data_process.rst:264::1
+msgid ":py:obj:`ResizeMix `"
+msgstr ""
+
+#: ../../api/data_process.rst:264::1 mmcls.models.utils.batch_augments.resizemix.ResizeMix:1 of
+msgid "ResizeMix Random Paste layer for a batch of data."
+msgstr ""
+
+#: ../../api/datasets.rst:7 ../../api/datasets.rst:14
+msgid "mmcls.datasets"
+msgstr ""
+
+#: ../../api/datasets.rst:9
+msgid ""
+"The ``datasets`` package contains several usual datasets for image classification tasks and some dataset "
+"wrappers."
+msgstr "``dataset`` 包中包含了分类任务中常用的数据集,以及一些数据集封装。"
+
+#: ../../api/datasets.rst:17
+msgid "Custom Dataset"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:1 of
+msgid "Custom dataset for classification."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:3 of
+msgid "The dataset supports two kinds of annotation format."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:5 of
+msgid "An annotation file is provided, and each line indicates a sample:"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:7 of
+msgid "The sample files: ::"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:19 of
+msgid ""
+"The annotation file (the first column is the image path and the second column is the index of category): ::"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:28 of
+msgid "Please specify the name of categories by the argument ``classes`` or ``metainfo``."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:31 of
+msgid "The samples are arranged in the specific way: ::"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:45 of
+msgid ""
+"If the ``ann_file`` is specified, the dataset will be generated by the first way, otherwise, try the second "
+"way."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model mmcls.apis.inference.init_model
+#: mmcls.datasets.base_dataset.BaseDataset mmcls.datasets.cifar.CIFAR10 mmcls.datasets.cifar.CIFAR100
+#: mmcls.datasets.cub.CUB mmcls.datasets.custom.CustomDataset mmcls.datasets.dataset_wrappers.KFoldDataset
+#: mmcls.datasets.imagenet.ImageNet mmcls.datasets.imagenet.ImageNet21k mmcls.datasets.mnist.FashionMNIST
+#: mmcls.datasets.mnist.MNIST mmcls.datasets.multi_label.MultiLabelDataset
+#: mmcls.datasets.transforms.auto_augment.AutoAugment mmcls.datasets.transforms.auto_augment.AutoContrast
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform mmcls.datasets.transforms.auto_augment.Brightness
+#: mmcls.datasets.transforms.auto_augment.ColorTransform mmcls.datasets.transforms.auto_augment.Contrast
+#: mmcls.datasets.transforms.auto_augment.Cutout mmcls.datasets.transforms.auto_augment.Equalize
+#: mmcls.datasets.transforms.auto_augment.Invert mmcls.datasets.transforms.auto_augment.Posterize
+#: mmcls.datasets.transforms.auto_augment.RandAugment mmcls.datasets.transforms.auto_augment.Rotate
+#: mmcls.datasets.transforms.auto_augment.Sharpness mmcls.datasets.transforms.auto_augment.Shear
+#: mmcls.datasets.transforms.auto_augment.Solarize mmcls.datasets.transforms.auto_augment.SolarizeAdd
+#: mmcls.datasets.transforms.auto_augment.Translate mmcls.datasets.transforms.formatting.Collect
+#: mmcls.datasets.transforms.formatting.PackClsInputs mmcls.datasets.transforms.formatting.ToNumpy
+#: mmcls.datasets.transforms.formatting.Transpose mmcls.datasets.transforms.processing.Albumentations
+#: mmcls.datasets.transforms.processing.Albumentations.transform
+#: mmcls.datasets.transforms.processing.ColorJitter mmcls.datasets.transforms.processing.ColorJitter.transform
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop mmcls.datasets.transforms.processing.Lighting
+#: mmcls.datasets.transforms.processing.Lighting.transform mmcls.datasets.transforms.processing.RandomCrop
+#: mmcls.datasets.transforms.processing.RandomCrop.transform
+#: mmcls.datasets.transforms.processing.RandomErasing
+#: mmcls.datasets.transforms.processing.RandomErasing.transform
+#: mmcls.datasets.transforms.processing.RandomResizedCrop
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform
+#: mmcls.datasets.transforms.processing.ResizeEdge mmcls.datasets.transforms.processing.ResizeEdge.transform
+#: mmcls.datasets.voc.VOC mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_test
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_train
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_val
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook.before_train
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_epoch
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter mmcls.engine.optimizers.lamb.Lamb
+#: mmcls.engine.optimizers.lamb.Lamb.step mmcls.evaluation.metrics.multi_label.AveragePrecision
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric mmcls.evaluation.metrics.single_label.Accuracy
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate
+#: mmcls.evaluation.metrics.single_label.Accuracy.compute_metrics
+#: mmcls.evaluation.metrics.single_label.Accuracy.process
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.compute_metrics
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.process
+#: mmcls.evaluation.metrics.voc_multi_label.VOCAveragePrecision
+#: mmcls.evaluation.metrics.voc_multi_label.VOCMultiLabelMetric mmcls.models.backbones.alexnet.AlexNet
+#: mmcls.models.backbones.conformer.Conformer mmcls.models.backbones.convmixer.ConvMixer
+#: mmcls.models.backbones.convnext.ConvNeXt mmcls.models.backbones.cspnet.CSPDarkNet
+#: mmcls.models.backbones.cspnet.CSPNet mmcls.models.backbones.cspnet.CSPResNeXt
+#: mmcls.models.backbones.cspnet.CSPResNet mmcls.models.backbones.davit.DaViT
+#: mmcls.models.backbones.deit.DistilledVisionTransformer mmcls.models.backbones.deit3.DeiT3
+#: mmcls.models.backbones.densenet.DenseNet mmcls.models.backbones.edgenext.EdgeNeXt
+#: mmcls.models.backbones.efficientformer.EfficientFormer mmcls.models.backbones.efficientnet.EfficientNet
+#: mmcls.models.backbones.hornet.HorNet mmcls.models.backbones.hrnet.HRNet
+#: mmcls.models.backbones.inception_v3.InceptionV3 mmcls.models.backbones.lenet.LeNet5
+#: mmcls.models.backbones.mlp_mixer.MlpMixer mmcls.models.backbones.mobilenet_v2.MobileNetV2
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2.make_layer mmcls.models.backbones.mobilenet_v3.MobileNetV3
+#: mmcls.models.backbones.mobileone.MobileOne mmcls.models.backbones.mobilevit.MobileViT
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilenetv2_layer
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer mmcls.models.backbones.mvit.MViT
+#: mmcls.models.backbones.poolformer.PoolFormer mmcls.models.backbones.regnet.RegNet
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet
+#: mmcls.models.backbones.regnet.RegNet.get_stages_from_blocks
+#: mmcls.models.backbones.regnet.RegNet.quantize_float mmcls.models.backbones.replknet.RepLKNet
+#: mmcls.models.backbones.repmlp.RepMLPNet mmcls.models.backbones.repvgg.RepVGG
+#: mmcls.models.backbones.res2net.Res2Net mmcls.models.backbones.resnest.ResNeSt
+#: mmcls.models.backbones.resnet.ResNet mmcls.models.backbones.resnet_cifar.ResNet_CIFAR
+#: mmcls.models.backbones.resnext.ResNeXt mmcls.models.backbones.seresnet.SEResNet
+#: mmcls.models.backbones.seresnext.SEResNeXt mmcls.models.backbones.shufflenet_v1.ShuffleNetV1
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1.make_layer
+#: mmcls.models.backbones.shufflenet_v2.ShuffleNetV2 mmcls.models.backbones.swin_transformer.SwinTransformer
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2 mmcls.models.backbones.t2t_vit.T2T_ViT
+#: mmcls.models.backbones.timm_backbone.TIMMBackbone mmcls.models.backbones.tnt.TNT
+#: mmcls.models.backbones.twins.PCPVT mmcls.models.backbones.twins.SVT mmcls.models.backbones.van.VAN
+#: mmcls.models.backbones.vgg.VGG mmcls.models.backbones.vision_transformer.VisionTransformer
+#: mmcls.models.classifiers.base.BaseClassifier mmcls.models.classifiers.base.BaseClassifier.extract_feat
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats
+#: mmcls.models.classifiers.base.BaseClassifier.forward
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict
+#: mmcls.models.classifiers.image.ImageClassifier mmcls.models.classifiers.image.ImageClassifier.extract_feat
+#: mmcls.models.classifiers.image.ImageClassifier.forward mmcls.models.classifiers.image.ImageClassifier.loss
+#: mmcls.models.classifiers.image.ImageClassifier.predict mmcls.models.classifiers.timm.TimmClassifier
+#: mmcls.models.classifiers.timm.TimmClassifier.loss mmcls.models.classifiers.timm.TimmClassifier.predict
+#: mmcls.models.heads.cls_head.ClsHead mmcls.models.heads.cls_head.ClsHead.loss
+#: mmcls.models.heads.cls_head.ClsHead.predict mmcls.models.heads.conformer_head.ConformerHead
+#: mmcls.models.heads.conformer_head.ConformerHead.predict mmcls.models.heads.deit_head.DeiTClsHead
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss
+#: mmcls.models.heads.linear_head.LinearClsHead mmcls.models.heads.margin_head.ArcFaceClsHead
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.set_margins
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward
+#: mmcls.models.losses.cross_entropy_loss.CrossEntropyLoss mmcls.models.losses.focal_loss.FocalLoss
+#: mmcls.models.losses.focal_loss.FocalLoss.forward mmcls.models.losses.label_smooth_loss.LabelSmoothLoss
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward mmcls.models.losses.seesaw_loss.SeesawLoss
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward mmcls.models.necks.gap.GlobalAveragePooling
+#: mmcls.models.necks.gem.GeneralizedMeanPooling mmcls.models.necks.hr_fuse.HRFuseScales
+#: mmcls.models.utils.attention.MultiheadAttention mmcls.models.utils.attention.ShiftWindowMSA
+#: mmcls.models.utils.attention.WindowMSA mmcls.models.utils.attention.WindowMSA.forward
+#: mmcls.models.utils.attention.WindowMSAV2 mmcls.models.utils.attention.WindowMSAV2.forward
+#: mmcls.models.utils.batch_augments.cutmix.CutMix
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.cutmix_bbox_and_lam
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.mix
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox_minmax
+#: mmcls.models.utils.batch_augments.mixup.Mixup mmcls.models.utils.batch_augments.mixup.Mixup.mix
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix
+#: mmcls.models.utils.channel_shuffle.channel_shuffle mmcls.models.utils.data_preprocessor.ClsDataPreprocessor
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward mmcls.models.utils.embed.HybridEmbed
+#: mmcls.models.utils.embed.PatchEmbed mmcls.models.utils.embed.PatchMerging
+#: mmcls.models.utils.embed.PatchMerging.forward mmcls.models.utils.embed.resize_pos_embed
+#: mmcls.models.utils.embed.resize_relative_position_bias_table mmcls.models.utils.helpers._ntuple
+#: mmcls.models.utils.inverted_residual.InvertedResidual
+#: mmcls.models.utils.inverted_residual.InvertedResidual.forward mmcls.models.utils.layer_scale.LayerScale
+#: mmcls.models.utils.make_divisible.make_divisible
+#: mmcls.models.utils.position_encoding.ConditionalPositionEncoding mmcls.models.utils.se_layer.SELayer
+#: mmcls.utils.setup_env.register_all_modules mmcls.visualization.cls_visualizer.ClsVisualizer of
+msgid "参数"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:48 mmcls.datasets.imagenet.ImageNet:6
+#: mmcls.datasets.imagenet.ImageNet21k:7 of
+msgid "Annotation file path. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:14 mmcls.datasets.custom.CustomDataset:50
+#: mmcls.datasets.imagenet.ImageNet:8 mmcls.datasets.imagenet.ImageNet21k:9
+#: mmcls.datasets.multi_label.MultiLabelDataset:35 of
+msgid "Meta information for dataset, such as class information. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:17 mmcls.datasets.custom.CustomDataset:53
+#: mmcls.datasets.imagenet.ImageNet:11 mmcls.datasets.imagenet.ImageNet21k:12
+#: mmcls.datasets.multi_label.MultiLabelDataset:38 of
+msgid "The root directory for ``data_prefix`` and ``ann_file``. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:56 of
+msgid "Prefix for the data. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:58 of
+msgid ""
+"A sequence of allowed extensions. Defaults to ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif')."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:37 mmcls.datasets.custom.CustomDataset:61
+#: mmcls.datasets.multi_label.MultiLabelDataset:59 of
+msgid ""
+"Whether to load annotation during instantiation. In some cases, such as visualization, only the meta "
+"information of the dataset is needed, which is not necessary to load annotation file. ``Basedataset`` can "
+"skip load annotations to save time by set ``lazy_init=False``. Defaults to False."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:20 mmcls.datasets.cifar.CIFAR100:17 mmcls.datasets.custom.CustomDataset:67
+#: mmcls.datasets.mnist.FashionMNIST:18 mmcls.datasets.mnist.MNIST:20 mmcls.datasets.voc.VOC:40 of
+msgid "Other keyword arguments in :class:`BaseDataset`."
+msgstr ""
+
+#: ../../api/datasets.rst:22
+msgid "ImageNet"
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet:1 of
+msgid "`ImageNet `_ Dataset."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet:3 of
+msgid ""
+"The dataset supports two kinds of annotation format. More details can be found in :class:`CustomDataset`."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:20 mmcls.datasets.imagenet.ImageNet:14
+#: mmcls.datasets.imagenet.ImageNet21k:15 mmcls.datasets.multi_label.MultiLabelDataset:41 of
+msgid "Prefix for training data. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet:16 mmcls.datasets.imagenet.ImageNet21k:20 of
+msgid "Other keyword arguments in :class:`CustomDataset` and :class:`BaseDataset`."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet21k:1 of
+msgid "ImageNet21k Dataset."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet21k:3 of
+msgid ""
+"Since the dataset ImageNet21k is extremely big, cantains 21k+ classes and 1.4B files. We won't provide the "
+"default categories list. Please specify it from the ``classes`` argument."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet21k:17 of
+msgid "Not implement by now. Use multi label or not. Defaults to False."
+msgstr ""
+
+#: ../../api/datasets.rst:29
+msgid "CIFAR"
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:1 of
+msgid "`CIFAR10 `_ Dataset."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:3 of
+msgid ""
+"This implementation is modified from https://github.com/pytorch/vision/blob/master/torchvision/datasets/"
+"cifar.py"
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:6 mmcls.datasets.cifar.CIFAR100:3 mmcls.datasets.mnist.FashionMNIST:4
+#: mmcls.datasets.mnist.MNIST:6 of
+msgid "Prefix for data."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:8 mmcls.datasets.cifar.CIFAR100:5 mmcls.datasets.cub.CUB:28
+#: mmcls.datasets.mnist.FashionMNIST:6 mmcls.datasets.mnist.MNIST:8 mmcls.datasets.voc.VOC:34 of
+msgid "``test_mode=True`` means in test phase. It determines to use the training set or test set."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:11 mmcls.datasets.cifar.CIFAR100:8 mmcls.datasets.mnist.FashionMNIST:9
+#: mmcls.datasets.mnist.MNIST:11 mmcls.datasets.voc.VOC:37 of
+msgid "Meta information for dataset, such as categories information. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:14 mmcls.datasets.cifar.CIFAR100:11 mmcls.datasets.mnist.FashionMNIST:12
+#: mmcls.datasets.mnist.MNIST:14 of
+msgid "The root directory for ``data_prefix``. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:17 mmcls.datasets.cifar.CIFAR100:14 mmcls.datasets.mnist.FashionMNIST:15
+#: mmcls.datasets.mnist.MNIST:17 of
+msgid "Whether to download the dataset if not exists. Defaults to True."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR100:1 of
+msgid "`CIFAR100 `_ Dataset."
+msgstr ""
+
+#: ../../api/datasets.rst:36
+msgid "MNIST"
+msgstr ""
+
+#: mmcls.datasets.mnist.MNIST:1 of
+msgid "`MNIST `_ Dataset."
+msgstr ""
+
+#: mmcls.datasets.mnist.MNIST:3 of
+msgid ""
+"This implementation is modified from https://github.com/pytorch/vision/blob/master/torchvision/datasets/"
+"mnist.py"
+msgstr ""
+
+#: mmcls.datasets.mnist.FashionMNIST:1 of
+msgid "`Fashion-MNIST `_ Dataset."
+msgstr ""
+
+#: ../../api/datasets.rst:43
+msgid "VOC"
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:1 of
+msgid "`Pascal VOC `_ Dataset."
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:3 of
+msgid "After decompression, the dataset directory structure is as follows:"
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:5 of
+msgid "VOC dataset directory: ::"
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:18 of
+msgid ""
+"Extra difficult label is in VOC annotations, we will use `gt_label_difficult` to record the difficult "
+"labels in each sample and corresponding evaluation should take care of this field to calculate metrics. "
+"Usually, difficult labels are reckoned as negative in defaults."
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:24 of
+msgid "The root directory for VOC dataset."
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:26 of
+msgid ""
+"The path of image set, The file which lists image ids of the sub dataset, and this path is relative to "
+"``data_root``."
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:30 of
+msgid ""
+"Prefix for data and annotation, keyword 'img_path' and 'ann_path' can be set. Defaults to be "
+"``dict(img_path='JPEGImages', ann_path='Annotations')``."
+msgstr ""
+
+#: ../../api/datasets.rst:48
+msgid "CUB"
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:1 of
+msgid "The CUB-200-2011 Dataset."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:3 of
+msgid ""
+"Support the `CUB-200-2011 `_ Dataset. Comparing "
+"with the `CUB-200 `_ Dataset, there are much more "
+"pictures in `CUB-200-2011`. After downloading and decompression, the dataset directory structure is as "
+"follows."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:8 of
+msgid "CUB dataset directory: ::"
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:26 of
+msgid "The root directory for CUB-200-2011 dataset."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:31 of
+msgid "Annotation file path, path relative to ``data_root``. Defaults to 'images.txt'."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:34 of
+msgid "Prefix for iamges, path relative to ``data_root``. Defaults to 'images'."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:37 of
+msgid "The label file, path relative to ``data_root``. Defaults to 'image_class_labels.txt'."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:40 of
+msgid ""
+"The split file to split train and test dataset, path relative to ``data_root``. Defaults to "
+"'train_test_split_file.txt'."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:46 mmcls.datasets.transforms.auto_augment.RandAugment:44
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:39
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:69 mmcls.evaluation.metrics.single_label.Accuracy:32
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:68 mmcls.models.backbones.mvit.MViT:80
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:75
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:78 mmcls.models.backbones.twins.PCPVT:46
+#: mmcls.models.backbones.twins.SVT:47 mmcls.models.backbones.van.VAN:50
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:49
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:25
+#: mmcls.models.classifiers.timm.TimmClassifier:40 mmcls.structures.cls_data_sample.ClsDataSample:21
+#: mmcls.visualization.cls_visualizer.ClsVisualizer:22 of
+msgid "实际案例"
+msgstr "使用示例"
+
+#: ../../api/datasets.rst:53
+msgid "Base classes"
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:1 of
+msgid "Base dataset for image classification task."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:3 mmcls.datasets.multi_label.MultiLabelDataset:3 of
+msgid "This dataset support annotation file in `OpenMMLab 2.0 style annotation format`."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:9 of
+msgid "Comparing with the :class:`mmengine.BaseDataset`, this class implemented several useful methods."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:12 mmcls.datasets.multi_label.MultiLabelDataset:33 of
+msgid "Annotation file path."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:22 mmcls.datasets.multi_label.MultiLabelDataset:43 of
+msgid "Config for filter data. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:24 of
+msgid ""
+"Support using first few data in annotation file to facilitate training/testing on a smaller dataset. "
+"Defaults to None, which means using all ``data_infos``."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:28 mmcls.datasets.multi_label.MultiLabelDataset:49 of
+msgid ""
+"Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from "
+"master process instead of making a copy. Defaults to True."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:32 of
+msgid "Processing pipeline. Defaults to an empty tuple."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:34 mmcls.datasets.multi_label.MultiLabelDataset:56 of
+msgid "``test_mode=True`` means in test phase. Defaults to False."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:43 mmcls.datasets.multi_label.MultiLabelDataset:65 of
+msgid ""
+"If ``Basedataset.prepare_data`` get a None img. The maximum extra number of cycles to get a valid image. "
+"Defaults to 1000."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:47 mmcls.datasets.multi_label.MultiLabelDataset:69 of
+msgid ""
+"Specify names of classes. - If is string, it should be a file path, and the every line of the file is a "
+"name of a class. - If is a sequence of string, every item is a name of class. - If is None, use categories "
+"information in ``metainfo`` argument, annotation file or the class attribute ``METAINFO``. Defaults to "
+"None."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:47 mmcls.datasets.multi_label.MultiLabelDataset:69 of
+msgid "Specify names of classes."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:49 mmcls.datasets.multi_label.MultiLabelDataset:71 of
+msgid "If is string, it should be a file path, and the every line of the file is a name of a class."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:51 mmcls.datasets.multi_label.MultiLabelDataset:73 of
+msgid "If is a sequence of string, every item is a name of class."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:52 mmcls.datasets.multi_label.MultiLabelDataset:74 of
+msgid ""
+"If is None, use categories information in ``metainfo`` argument, annotation file or the class attribute "
+"``METAINFO``."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:55 mmcls.datasets.multi_label.MultiLabelDataset:77
+#: mmcls.models.backbones.hrnet.HRNet:23 mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:32
+#: mmcls.models.classifiers.image.ImageClassifier:23 mmcls.models.classifiers.timm.TimmClassifier:23 of
+msgid "Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.multi_label.MultiLabelDataset:1 of
+msgid "Multi-label Dataset."
+msgstr ""
+
+#: mmcls.datasets.multi_label.MultiLabelDataset:9 of
+msgid "The annotation format is shown as follows."
+msgstr ""
+
+#: mmcls.datasets.multi_label.MultiLabelDataset:45 of
+msgid ""
+"Support using first few data in annotation file to facilitate training/testing on a smaller dataset. "
+"Defaults to None which means using all ``data_infos``."
+msgstr ""
+
+#: mmcls.datasets.multi_label.MultiLabelDataset:54 of
+msgid "Processing pipeline. Defaults to []."
+msgstr ""
+
+#: ../../api/datasets.rst:60
+msgid "Dataset Wrappers"
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:1 of
+msgid "A wrapper of dataset for K-Fold cross-validation."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:3 of
+msgid ""
+"K-Fold cross-validation divides all the samples in groups of samples, called folds, of almost equal sizes. "
+"And we use k-1 of folds to do training and use the fold left to do validation."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:7 of
+msgid "The dataset to be divided"
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:10 of
+msgid "The fold used to do validation. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:12 of
+msgid "The number of all folds. Defaults to 5."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:14 of
+msgid "Use the training dataset or validation dataset. Defaults to False."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:17 of
+msgid "The seed to shuffle the dataset before splitting. If None, not shuffle the dataset. Defaults to None."
+msgstr ""
+
+#: ../../api/datasets.rst:64
+msgid "The dataset wrappers in the MMEngine can be directly used in MMClassification."
+msgstr ""
+
+#: ../../api/datasets.rst:68
+msgid ":class:`~mmengine.dataset.ConcatDataset`"
+msgstr ""
+
+#: ../../api/datasets.rst:69
+msgid "A wrapper of concatenated dataset."
+msgstr ""
+
+#: ../../api/datasets.rst:70
+msgid ":class:`~mmengine.dataset.RepeatDataset`"
+msgstr ""
+
+#: ../../api/datasets.rst:71
+msgid "A wrapper of repeated dataset."
+msgstr ""
+
+#: ../../api/datasets.rst:72
+msgid ":class:`~mmengine.dataset.ClassBalancedDataset`"
+msgstr ""
+
+#: ../../api/datasets.rst:73
+msgid "A wrapper of class balanced dataset."
+msgstr ""
+
+#: ../../api/engine.rst:7 ../../api/engine.rst:19
+msgid "mmcls.engine"
+msgstr ""
+
+#: ../../api/engine.rst:9
+msgid ""
+"This package includes some runtime components, including hooks, runners, optimizers and loops. These "
+"components are useful in classification tasks but not supported by MMEngine yet."
+msgstr ""
+"该包中包含了一些运行时组件,如钩子(hook)、执行器(runner)、优化器(optimizer)和循环执行器(loop)。这些"
+"组件在分类任务中需要用到,而还未被 MMEngine 支持。"
+
+#: ../../api/engine.rst:14
+msgid "Some components may be moved to MMEngine in the future."
+msgstr "部分组件未来可能会被移动到 MMEngine 中。"
+
+#: ../../api/engine.rst:24
+msgid "Hooks"
+msgstr ""
+
+#: ../../api/engine.rst:36::1
+msgid ":py:obj:`ClassNumCheckHook `"
+msgstr ""
+
+#: ../../api/engine.rst:36::1 mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook:1 of
+msgid "Class Number Check HOOK."
+msgstr ""
+
+#: ../../api/engine.rst:36::1
+msgid ":py:obj:`PreciseBNHook `"
+msgstr ""
+
+#: ../../api/engine.rst:36::1 mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:1 of
+msgid "Precise BN hook."
+msgstr ""
+
+#: ../../api/engine.rst:36::1
+msgid ":py:obj:`VisualizationHook `"
+msgstr ""
+
+#: ../../api/engine.rst:36::1
+msgid "Classification Visualization Hook."
+msgstr ""
+
+#: ../../api/engine.rst:36::1
+msgid ":py:obj:`PrepareProtoBeforeValLoopHook `"
+msgstr ""
+
+#: ../../api/engine.rst:36::1 mmcls.engine.hooks.retriever_hooks.PrepareProtoBeforeValLoopHook:1
+#: of
+msgid "The hook to prepare the prototype in retrievers."
+msgstr ""
+
+#: ../../api/engine.rst:36::1
+msgid ":py:obj:`SetAdaptiveMarginsHook `"
+msgstr ""
+
+#: ../../api/engine.rst:36::1 mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:1 of
+msgid "Set adaptive-margins in ArcFaceClsHead based on the power of category-wise count."
+msgstr ""
+
+#: ../../api/engine.rst:40
+msgid "Optimizers"
+msgstr ""
+
+#: ../../api/engine.rst:47::1
+msgid ":py:obj:`Lamb `"
+msgstr ""
+
+#: ../../api/engine.rst:47::1 mmcls.engine.optimizers.lamb.Lamb:1 of
+msgid "A pure pytorch variant of FuseLAMB (NvLamb variant) optimizer."
+msgstr ""
+
+#: ../../api/evaluation.rst:7 ../../api/evaluation.rst:14
+msgid "mmcls.evaluation"
+msgstr ""
+
+#: ../../api/evaluation.rst:9
+msgid "This package includes metrics and evaluators for classification tasks."
+msgstr "该包中包含了用于分类任务的一系列评测指标及评测器。"
+
+#: ../../api/evaluation.rst:17
+msgid "Single Label Metric"
+msgstr ""
+
+#: ../../api/evaluation.rst:26::1
+msgid ":py:obj:`Accuracy `"
+msgstr ""
+
+#: ../../api/evaluation.rst:26::1 mmcls.evaluation.metrics.single_label.Accuracy:1 of
+msgid "Accuracy evaluation metric."
+msgstr ""
+
+#: ../../api/evaluation.rst:26::1
+msgid ":py:obj:`SingleLabelMetric `"
+msgstr ""
+
+#: ../../api/evaluation.rst:26::1 mmcls.evaluation.metrics.single_label.SingleLabelMetric:1 of
+msgid "A collection of precision, recall, f1-score and support for single-label tasks."
+msgstr ""
+
+#: ../../api/evaluation.rst:28
+msgid "Multi Label Metric"
+msgstr ""
+
+#: ../../api/evaluation.rst:36::1
+msgid ":py:obj:`AveragePrecision `"
+msgstr ""
+
+#: ../../api/evaluation.rst:36::1 mmcls.evaluation.metrics.multi_label.AveragePrecision:1 of
+msgid "Calculate the average precision with respect of classes."
+msgstr ""
+
+#: ../../api/evaluation.rst:36::1
+msgid ":py:obj:`MultiLabelMetric `"
+msgstr ""
+
+#: ../../api/evaluation.rst:36::1 mmcls.evaluation.metrics.multi_label.MultiLabelMetric:1 of
+msgid "A collection of precision, recall, f1-score and support for multi-label tasks."
+msgstr ""
+
+#: ../../api/evaluation.rst:36::1
+msgid ":py:obj:`VOCAveragePrecision `"
+msgstr ""
+
+#: ../../api/evaluation.rst:36::1 mmcls.evaluation.metrics.voc_multi_label.VOCAveragePrecision:1
+#: of
+msgid "Calculate the average precision with respect of classes for VOC dataset."
+msgstr ""
+
+#: ../../api/evaluation.rst:36::1
+msgid ":py:obj:`VOCMultiLabelMetric `"
+msgstr ""
+
+#: ../../api/evaluation.rst:36::1 mmcls.evaluation.metrics.voc_multi_label.VOCMultiLabelMetric:1
+#: of
+msgid ""
+"A collection of metrics for multi-label multi-class classification task based on confusion matrix for VOC "
+"dataset."
+msgstr ""
+
+#: ../../api/generated/mmcls.apis.inference_model.rst:2
+msgid "mmcls.apis.inference\\_model"
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:3 of
+msgid "The loaded classifier."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:5 of
+msgid "The image filename or loaded image."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model mmcls.apis.inference.init_model
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder
+#: mmcls.datasets.transforms.processing.Albumentations.mapper
+#: mmcls.datasets.transforms.processing.Albumentations.transform
+#: mmcls.datasets.transforms.processing.ColorJitter.transform
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform
+#: mmcls.datasets.transforms.processing.Lighting.transform
+#: mmcls.datasets.transforms.processing.RandomCrop.transform
+#: mmcls.datasets.transforms.processing.RandomErasing.transform
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate
+#: mmcls.evaluation.metrics.single_label.Accuracy.compute_metrics
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.compute_metrics
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet
+#: mmcls.models.backbones.regnet.RegNet.get_stages_from_blocks
+#: mmcls.models.backbones.regnet.RegNet.quantize_float
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats
+#: mmcls.models.classifiers.base.BaseClassifier.forward
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat
+#: mmcls.models.classifiers.image.ImageClassifier.forward mmcls.models.classifiers.image.ImageClassifier.loss
+#: mmcls.models.classifiers.timm.TimmClassifier.loss mmcls.models.classifiers.timm.TimmClassifier.predict
+#: mmcls.models.heads.cls_head.ClsHead.loss mmcls.models.heads.cls_head.ClsHead.predict
+#: mmcls.models.heads.conformer_head.ConformerHead.predict
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward mmcls.models.losses.focal_loss.FocalLoss.forward
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward mmcls.models.utils.batch_augments.cutmix.CutMix.mix
+#: mmcls.models.utils.batch_augments.mixup.Mixup.mix mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix
+#: mmcls.models.utils.channel_shuffle.channel_shuffle
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward
+#: mmcls.models.utils.embed.PatchMerging.forward mmcls.models.utils.embed.resize_pos_embed
+#: mmcls.models.utils.embed.resize_relative_position_bias_table
+#: mmcls.models.utils.inverted_residual.InvertedResidual.forward
+#: mmcls.models.utils.make_divisible.make_divisible of
+msgid "返回"
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:8 of
+msgid "The classification results that contains `class_name`, `pred_label` and `pred_score`."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:10 of
+msgid "The classification results that contains"
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:11 of
+msgid "`class_name`, `pred_label` and `pred_score`."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model mmcls.apis.inference.init_model
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder
+#: mmcls.datasets.transforms.processing.Albumentations.mapper
+#: mmcls.datasets.transforms.processing.Albumentations.transform
+#: mmcls.datasets.transforms.processing.ColorJitter.transform
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform
+#: mmcls.datasets.transforms.processing.Lighting.transform
+#: mmcls.datasets.transforms.processing.RandomCrop.transform
+#: mmcls.datasets.transforms.processing.RandomErasing.transform
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate
+#: mmcls.evaluation.metrics.single_label.Accuracy.compute_metrics
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.compute_metrics
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet
+#: mmcls.models.backbones.regnet.RegNet.get_stages_from_blocks
+#: mmcls.models.backbones.regnet.RegNet.quantize_float
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat
+#: mmcls.models.classifiers.image.ImageClassifier.loss mmcls.models.classifiers.timm.TimmClassifier.loss
+#: mmcls.models.classifiers.timm.TimmClassifier.predict mmcls.models.heads.cls_head.ClsHead.loss
+#: mmcls.models.heads.cls_head.ClsHead.predict mmcls.models.heads.conformer_head.ConformerHead.predict
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward mmcls.models.losses.focal_loss.FocalLoss.forward
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward mmcls.models.utils.batch_augments.cutmix.CutMix.mix
+#: mmcls.models.utils.batch_augments.mixup.Mixup.mix mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix
+#: mmcls.models.utils.channel_shuffle.channel_shuffle
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward
+#: mmcls.models.utils.embed.PatchMerging.forward mmcls.models.utils.embed.resize_pos_embed
+#: mmcls.models.utils.embed.resize_relative_position_bias_table
+#: mmcls.models.utils.inverted_residual.InvertedResidual.forward
+#: mmcls.models.utils.make_divisible.make_divisible of
+msgid "返回类型"
+msgstr ""
+
+#: ../../api/generated/mmcls.apis.init_model.rst:2
+msgid "mmcls.apis.init\\_model"
+msgstr ""
+
+#: mmcls.apis.inference.init_model:3 of
+msgid "Config file path or the config object."
+msgstr ""
+
+#: mmcls.apis.inference.init_model:6 of
+msgid "Checkpoint path. If left as None, the model will not load any weights."
+msgstr ""
+
+#: mmcls.apis.inference.init_model:9 of
+msgid "Options to override some settings in the used config."
+msgstr ""
+
+#: mmcls.apis.inference.init_model:12 of
+msgid "The constructed classifier."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Albumentations.rst:7
+msgid "Albumentations"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:3 mmcls.datasets.transforms.formatting.PackClsInputs:3
+#: mmcls.datasets.transforms.formatting.ToNumpy:3 mmcls.datasets.transforms.formatting.ToPIL:3
+#: mmcls.datasets.transforms.formatting.Transpose:3 mmcls.datasets.transforms.processing.Albumentations:3
+#: mmcls.datasets.transforms.processing.ColorJitter:7
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:3
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:3
+#: mmcls.datasets.transforms.processing.Lighting:3 mmcls.datasets.transforms.processing.RandomCrop:3
+#: mmcls.datasets.transforms.processing.RandomErasing:3
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:7 mmcls.datasets.transforms.processing.ResizeEdge:3
+#: of
+msgid "**Required Keys:**"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:5 mmcls.datasets.transforms.formatting.ToPIL:5
+#: mmcls.datasets.transforms.formatting.ToPIL:9 mmcls.datasets.transforms.processing.Albumentations:5
+#: mmcls.datasets.transforms.processing.Albumentations:9 mmcls.datasets.transforms.processing.ColorJitter:9
+#: mmcls.datasets.transforms.processing.ColorJitter:13
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:5
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:9
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:5
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:9
+#: mmcls.datasets.transforms.processing.Lighting:5 mmcls.datasets.transforms.processing.Lighting:9
+#: mmcls.datasets.transforms.processing.RandomCrop:5 mmcls.datasets.transforms.processing.RandomCrop:9
+#: mmcls.datasets.transforms.processing.RandomErasing:5 mmcls.datasets.transforms.processing.RandomErasing:9
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:9
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:13 mmcls.datasets.transforms.processing.ResizeEdge:5
+#: mmcls.datasets.transforms.processing.ResizeEdge:9 of
+msgid "img"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToNumpy:7 mmcls.datasets.transforms.formatting.ToPIL:7
+#: mmcls.datasets.transforms.formatting.Transpose:7 mmcls.datasets.transforms.processing.Albumentations:7
+#: mmcls.datasets.transforms.processing.ColorJitter:11
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:7
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:7
+#: mmcls.datasets.transforms.processing.Lighting:7 mmcls.datasets.transforms.processing.RandomCrop:7
+#: mmcls.datasets.transforms.processing.RandomErasing:7
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:11 mmcls.datasets.transforms.processing.ResizeEdge:7
+#: of
+msgid "**Modified Keys:**"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:10
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:10
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:10
+#: mmcls.datasets.transforms.processing.RandomCrop:10
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:14
+#: mmcls.datasets.transforms.processing.ResizeEdge:10 of
+msgid "img_shape"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:12 of
+msgid ""
+"Adds custom transformations from albumentations library. More details can be found in `Albumentations "
+"`_. An example of ``transforms`` is as followed:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:42 of
+msgid "List of albumentations transform configs."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:44 of
+msgid ""
+"Mapping of mmcls to albumentations fields, in format {'input key':'albumentation-style key'}. Defaults to "
+"None."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:50 mmcls.models.backbones.cspnet.CSPDarkNet:30
+#: mmcls.models.backbones.cspnet.CSPNet:63 mmcls.models.backbones.cspnet.CSPResNeXt:28
+#: mmcls.models.backbones.cspnet.CSPResNet:28 mmcls.models.backbones.efficientformer.EfficientFormer:53
+#: mmcls.models.backbones.hrnet.HRNet:52 mmcls.models.backbones.inception_v3.InceptionV3:23
+#: mmcls.models.backbones.mobileone.MobileOne:48 mmcls.models.backbones.regnet.RegNet:45
+#: mmcls.models.backbones.res2net.Res2Net:56 mmcls.models.backbones.resnet.ResNet:54
+#: mmcls.models.backbones.seresnet.SEResNet:56 mmcls.models.heads.margin_head.ArcFaceClsHead:9 of
+msgid "示例"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder:1 of
+msgid "Import a module from albumentations."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder:3 of
+msgid ""
+"It inherits some of :func:`build_from_cfg` logic. :param cfg: Config dict. It should at least contain the "
+"key \"type\". :type cfg: dict"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder:7 of
+msgid "The constructed object."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.mapper:1 of
+msgid "Dictionary mapper."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.mapper:3 of
+msgid ""
+"Renames keys according to keymap provided. :param d: old dict :type d: dict :param keymap: "
+"{'old_key':'new_key'} :type keymap: dict"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.mapper:9 of
+msgid "new dict."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:1 of
+msgid "Transform function to perform albumentations transforms."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:3
+#: mmcls.datasets.transforms.processing.ColorJitter.transform:3
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:3
+#: mmcls.datasets.transforms.processing.Lighting.transform:3
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:3
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:3
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform:3 of
+msgid "Result dict from loading pipeline."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:6 of
+msgid "Transformed results, 'img' and 'img_shape' keys are updated in result dict."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:8 of
+msgid "Transformed results, 'img' and 'img_shape' keys are"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:9 of
+msgid "updated in result dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.AutoAugment.rst:7
+msgid "AutoAugment"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoAugment:3 of
+msgid ""
+"This data augmentation is proposed in `AutoAugment: Learning Augmentation Policies from Data `_."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoAugment:6 of
+msgid ""
+"The policies of auto augmentation. If string, use preset policies collection like \"imagenet\". If list, "
+"Each item is a sub policies, composed by several augmentation policy dicts. When AutoAugment is called, a "
+"random sub policies in ``policies`` will be selected to augment images."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoAugment:12 mmcls.datasets.transforms.auto_augment.RandAugment:38
+#: of
+msgid ""
+"Configs of hyperparameters. Hyperparameters will be used in policies that require these arguments if these "
+"arguments are not set in policy dicts. Defaults to ``dict(pad_val=128)``."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.AutoContrast.rst:7
+msgid "AutoContrast"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoContrast:3 of
+msgid "The probability for performing auto contrast therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoContrast:6 mmcls.datasets.transforms.auto_augment.Brightness:15
+#: mmcls.datasets.transforms.auto_augment.ColorTransform:15 mmcls.datasets.transforms.auto_augment.Cutout:15
+#: mmcls.datasets.transforms.auto_augment.Equalize:6 mmcls.datasets.transforms.auto_augment.Invert:6
+#: mmcls.datasets.transforms.auto_augment.Posterize:11 mmcls.datasets.transforms.auto_augment.Rotate:27
+#: mmcls.datasets.transforms.auto_augment.Sharpness:15 mmcls.datasets.transforms.auto_augment.Shear:23
+#: mmcls.datasets.transforms.auto_augment.Solarize:10 mmcls.datasets.transforms.auto_augment.SolarizeAdd:13
+#: mmcls.datasets.transforms.auto_augment.Translate:25 of
+msgid "Other keyword arguments of :class:`BaseAugTransform`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoContrast.transform:1
+#: mmcls.datasets.transforms.auto_augment.Brightness.transform:1
+#: mmcls.datasets.transforms.auto_augment.ColorTransform.transform:1
+#: mmcls.datasets.transforms.auto_augment.Contrast.transform:1
+#: mmcls.datasets.transforms.auto_augment.Cutout.transform:1
+#: mmcls.datasets.transforms.auto_augment.Equalize.transform:1
+#: mmcls.datasets.transforms.auto_augment.Invert.transform:1
+#: mmcls.datasets.transforms.auto_augment.Posterize.transform:1
+#: mmcls.datasets.transforms.auto_augment.Rotate.transform:1
+#: mmcls.datasets.transforms.auto_augment.Sharpness.transform:1
+#: mmcls.datasets.transforms.auto_augment.Shear.transform:1
+#: mmcls.datasets.transforms.auto_augment.Solarize.transform:1
+#: mmcls.datasets.transforms.auto_augment.SolarizeAdd.transform:1
+#: mmcls.datasets.transforms.auto_augment.Translate.transform:1 of
+msgid "Apply transform to results."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.BaseAugTransform.rst:7
+msgid "BaseAugTransform"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:3 of
+msgid ""
+"This class provides several common attributions and methods to support the magnitude level mapping and "
+"magnitude level randomness in :class:`RandAugment`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:7 of
+msgid "Magnitude level."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:9 of
+msgid ""
+"For augmentation have magnitude argument, maybe \"magnitude\", \"angle\" or other, you can specify the "
+"magnitude level mapping range to generate the magnitude argument. For example, assume ``total_level`` is "
+"10, ``magnitude_level=3`` specify magnitude is 3 if ``magnitude_range=(0, 10)`` while specify magnitude is "
+"7 if ``magnitude_range=(10, 0)``. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:17 of
+msgid ""
+"Deviation of magnitude noise applied. - If positive number, the magnitude obeys normal distribution :"
+"math:`\\mathcal{N}(magnitude, magnitude_std)`. - If 0 or negative number, magnitude remains unchanged. - If "
+"str \"inf\", the magnitude obeys uniform distribution :math:`Uniform(min, magnitude)`. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:17
+#: mmcls.datasets.transforms.auto_augment.RandAugment:27 of
+msgid "Deviation of magnitude noise applied."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:19 of
+msgid ""
+"If positive number, the magnitude obeys normal distribution :math:`\\mathcal{N}(magnitude, magnitude_std)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:21
+#: mmcls.datasets.transforms.auto_augment.RandAugment:31 of
+msgid "If 0 or negative number, magnitude remains unchanged."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:22
+#: mmcls.datasets.transforms.auto_augment.RandAugment:32 of
+msgid "If str \"inf\", the magnitude obeys uniform distribution :math:`Uniform(min, magnitude)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:25 of
+msgid "Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:27
+#: mmcls.datasets.transforms.auto_augment.RandAugment:35 of
+msgid "Total level for the magnitude. Defaults to 10."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:30 of
+msgid "The probability for performing transformation therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:33 of
+msgid "The probability that turns the magnitude negative, which should be in range [0,1]. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform.extra_repr:1 of
+msgid "Extra repr string when auto-generating magnitude is enabled."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Brightness.rst:7
+msgid "Brightness"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Brightness:3 of
+msgid ""
+"The magnitude used for adjusting brightness. A positive magnitude would enhance the brightness and a "
+"negative magnitude would make the image darker. A magnitude=0 gives the origin img. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Brightness:9 of
+msgid ""
+"The probability for performing brightness adjusting therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Brightness:12
+#: mmcls.datasets.transforms.auto_augment.ColorTransform:12 mmcls.datasets.transforms.auto_augment.Contrast:12
+#: mmcls.datasets.transforms.auto_augment.Sharpness:12 mmcls.datasets.transforms.auto_augment.Shear:17
+#: mmcls.datasets.transforms.auto_augment.Translate:19 of
+msgid "The probability that turns the magnitude negative, which should be in range [0,1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Collect.rst:7
+msgid "Collect"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:5 mmcls.datasets.transforms.formatting.Transpose:5
+#: mmcls.datasets.transforms.formatting.Transpose:9 of
+msgid "``*keys``"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:7 mmcls.datasets.transforms.formatting.PackClsInputs:9 of
+msgid "**Deleted Keys:**"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:9 of
+msgid "All keys except those in the argument ``*keys``."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:11 of
+msgid "The keys of the fields to be collected."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ColorJitter.rst:7
+msgid "ColorJitter"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:3 of
+msgid ""
+"Modified from https://github.com/pytorch/vision/blob/main/torchvision/transforms/transforms.py Licensed "
+"under the BSD 3-Clause License."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:15 of
+msgid ""
+"How much to jitter brightness. brightness_factor is chosen uniformly from ``[max(0, 1 - brightness), 1 + "
+"brightness]`` or the given ``[min, max]``. Should be non negative numbers. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:20 of
+msgid ""
+"How much to jitter contrast. contrast_factor is chosen uniformly from ``[max(0, 1 - contrast), 1 + "
+"contrast]`` or the given ``[min, max]``. Should be non negative numbers. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:25 of
+msgid ""
+"How much to jitter saturation. saturation_factor is chosen uniformly from ``[max(0, 1 - saturation), 1 + "
+"saturation]`` or the given ``[min, max]``. Should be non negative numbers. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:30 of
+msgid ""
+"How much to jitter hue. hue_factor is chosen uniformly from ``[-hue, hue]`` (0 <= hue <= 0.5) or the given "
+"``[min, max]`` (-0.5 <= min <= max <= 0.5). Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter.transform:1
+#: mmcls.datasets.transforms.processing.Lighting.transform:1
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform:1 of
+msgid "Transform function to resize images."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter.transform:6 of
+msgid "ColorJitter results, 'img' key is updated in result dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ColorTransform.rst:7
+msgid "ColorTransform"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.ColorTransform:3 of
+msgid ""
+"The magnitude used for color transform. A positive magnitude would enhance the color and a negative "
+"magnitude would make the image grayer. A magnitude=0 gives the origin img. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.ColorTransform:9 of
+msgid "The probability for performing ColorTransform therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Contrast.rst:7
+msgid "Contrast"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Contrast:3 of
+msgid ""
+"The magnitude used for adjusting contrast. A positive magnitude would enhance the contrast and a negative "
+"magnitude would make the image grayer. A magnitude=0 gives the origin img. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Contrast:9 of
+msgid ""
+"The probability for performing contrast adjusting therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Cutout.rst:7
+msgid "Cutout"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Cutout:3 of
+msgid ""
+"Expected cutout shape (h, w). If given as a single value, the value will be used for both h and w. If None, "
+"generate from ``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Cutout:8 of
+msgid ""
+"Pixel pad_val value for constant fill. If it is a sequence, it must have the same length with the image "
+"channels. Defaults to 128."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Cutout:12 of
+msgid "The probability for performing cutout therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.EfficientNetCenterCrop.rst:7
+msgid "EfficientNetCenterCrop"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:12 of
+msgid "Expected size after cropping with the format of (h, w)."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:15
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:18 of
+msgid "The crop padding parameter in efficientnet style center crop. Defaults to 32."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:18 of
+msgid ""
+"Interpolation method, accepted values are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Only valid "
+"if ``efficientnet_style`` is True. Defaults to 'bicubic'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:22 of
+msgid ""
+"The image resize backend type, accepted values are `cv2` and `pillow`. Only valid if efficientnet style is "
+"True. Defaults to `cv2`."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:28
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead:17
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead:17
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:30 of
+msgid "提示"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:29 of
+msgid "If the image is smaller than the crop size, return the original image."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:31 of
+msgid "The pipeline will be to first to perform the center crop with the ``crop_size_`` as:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:34 of
+msgid ""
+"\\text{crop_size_} = \\frac{\\text{crop_size}}{\\text{crop_size} +\n"
+"\\text{crop_padding}} \\times \\text{short_edge}"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:39 of
+msgid "And then the pipeline resizes the img to the input crop size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:1
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:1 of
+msgid "Transform function to randomly resized crop images."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:6 of
+msgid ""
+"EfficientNet style center cropped results, 'img_shape' key in result dict is updated according to crop "
+"size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:8 of
+msgid "EfficientNet style center cropped results, 'img_shape'"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:9
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:9
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:9 of
+msgid "key in result dict is updated according to crop size."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.EfficientNetRandomCrop.rst:7
+msgid "EfficientNetRandomCrop"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:12 of
+msgid "Desired output scale of the crop. Only int size is accepted, a square crop (size, size) is made."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:15 of
+msgid "Minimum ratio of the cropped area to the original area. Defaults to 0.1."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:21
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:20 of
+msgid "Range of the random size of the cropped image compared to the original image. Defaults to (0.08, 1.0)."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:24
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:23 of
+msgid ""
+"Range of the random aspect ratio of the cropped image compared to the original image. Defaults to (3. / 4., "
+"4. / 3.)."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:28
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:27 of
+msgid "Maximum number of attempts before falling back to Central Crop. Defaults to 10."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:31 of
+msgid ""
+"Interpolation method, accepted values are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to "
+"'bicubic'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:35
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:34 of
+msgid "The image resize backend type, accepted values are 'cv2' and 'pillow'. Defaults to 'cv2'."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Equalize.rst:7
+msgid "Equalize"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Equalize:3 of
+msgid "The probability for performing equalize therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Invert.rst:7
+msgid "Invert"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Invert:3 of
+msgid "The probability for performing invert therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Lighting.rst:7
+msgid "Lighting"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting:11 of
+msgid "the eigenvalue of the convariance matrix of pixel values, respectively."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting:14 of
+msgid "the eigenvector of the convariance matrix of pixel values, respectively."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting:17 of
+msgid "The standard deviation for distribution of alpha. Defaults to 0.1."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting:20 of
+msgid "Whether to convert img to rgb. Defaults to False."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting.transform:6 of
+msgid "Lightinged results, 'img' key is updated in result dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.PackClsInputs.rst:7
+msgid "PackClsInputs"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:6 of
+msgid "gt_label (optional)"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:7 of
+msgid "``*meta_keys`` (optional)"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:11 of
+msgid "All keys in the dict."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:13 mmcls.datasets.transforms.processing.ResizeEdge:12 of
+msgid "**Added Keys:**"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:15 of
+msgid "inputs (:obj:`torch.Tensor`): The forward data of models."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:16 of
+msgid "data_samples (:obj:`~mmcls.structures.ClsDataSample`): The annotation info of the sample."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:19 of
+msgid ""
+"The meta keys to be saved in the ``metainfo`` of the packed ``data_samples``. Defaults to a tuple includes "
+"keys: - ``sample_idx``: The id of the image sample. - ``img_path``: The path to the image file. - "
+"``ori_shape``: The original shape of the image as a tuple (H, W). - ``img_shape``: The shape of the image "
+"after the pipeline as a tuple (H, W). - ``scale_factor``: The scale factor between the resized image "
+"and the original image. - ``flip``: A boolean indicating if image flip transform was used. - "
+"``flip_direction``: The flipping direction."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:19 of
+msgid ""
+"The meta keys to be saved in the ``metainfo`` of the packed ``data_samples``. Defaults to a tuple includes "
+"keys:"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:23 of
+msgid "``sample_idx``: The id of the image sample."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:24 of
+msgid "``img_path``: The path to the image file."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:25 of
+msgid "``ori_shape``: The original shape of the image as a tuple (H, W)."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:26 of
+msgid "``img_shape``: The shape of the image after the pipeline as a tuple (H, W)."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:28 of
+msgid "``scale_factor``: The scale factor between the resized image and the original image."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:30 of
+msgid "``flip``: A boolean indicating if image flip transform was used."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:31 of
+msgid "``flip_direction``: The flipping direction."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs.transform:1 of
+msgid "Method to pack the input data."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Posterize.rst:7
+msgid "Posterize"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Posterize:3 of
+msgid ""
+"Number of bits for each pixel in the output img, which should be less or equal to 8. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Posterize:8 of
+msgid "The probability for posterizing therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.RandAugment.rst:7
+msgid "RandAugment"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:3 of
+msgid ""
+"This data augmentation is proposed in `RandAugment: Practical automated data augmentation with a reduced "
+"search space `_."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:7 of
+msgid ""
+"The policies of random augmentation. If string, use preset policies collection like \"timm_increasing\". If "
+"list, each item is one specific augmentation policy dict. The policy dict shall should have these keys: - "
+"``type`` (str), The type of augmentation. - ``magnitude_range`` (Sequence[number], optional): For those "
+"augmentation have magnitude, you need to specify the magnitude level mapping range. For example, assume "
+"``total_level`` is 10, ``magnitude_level=3`` specify magnitude is 3 if ``magnitude_range=(0, 10)`` "
+"while specify magnitude is 7 if ``magnitude_range=(10, 0)``. - other keyword arguments of the "
+"augmentation."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:7 of
+msgid ""
+"The policies of random augmentation. If string, use preset policies collection like \"timm_increasing\". If "
+"list, each item is one specific augmentation policy dict. The policy dict shall should have these keys:"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:12 of
+msgid "``type`` (str), The type of augmentation."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:13 of
+msgid ""
+"``magnitude_range`` (Sequence[number], optional): For those augmentation have magnitude, you need to "
+"specify the magnitude level mapping range. For example, assume ``total_level`` is 10, ``magnitude_level=3`` "
+"specify magnitude is 3 if ``magnitude_range=(0, 10)`` while specify magnitude is 7 if "
+"``magnitude_range=(10, 0)``."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:19 of
+msgid "other keyword arguments of the augmentation."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:21 of
+msgid "Number of policies to select from policies each time."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:24 of
+msgid "Magnitude level for all the augmentation selected."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:27 of
+msgid ""
+"Deviation of magnitude noise applied. - If positive number, the magnitude obeys normal distribution :"
+"math:`\\mathcal{N}(magnitude_level, magnitude_std)`. - If 0 or negative number, magnitude remains "
+"unchanged. - If str \"inf\", the magnitude obeys uniform distribution :math:`Uniform(min, magnitude)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:29 of
+msgid ""
+"If positive number, the magnitude obeys normal distribution :math:`\\mathcal{N}(magnitude_level, "
+"magnitude_std)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:45 of
+msgid ""
+"To use \"timm-increasing\" policies collection, select two policies every time, and magnitude_level of "
+"every policy is 6 (total is 10 by default)"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:60 of
+msgid ""
+"If you want the ``magnitude_level`` randomly changes every time, you can use ``magnitude_std`` to specify "
+"the random distribution. For example, a normal distribution :math:`\\mathcal{N}(6, 0.5)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:71 of
+msgid "You can also use your own policies:"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:86 of
+msgid ""
+"``magnitude_std`` will introduce some randomness to policy, modified by https://github.com/rwightman/"
+"pytorch-image-models."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:89 of
+msgid "When magnitude_std=0, we calculate the magnitude as follows:"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:91 of
+msgid ""
+"\\text{magnitude} = \\frac{\\text{magnitude_level}}\n"
+"{\\text{totallevel}} \\times (\\text{val2} - \\text{val1})\n"
+"+ \\text{val1}\n"
+"\n"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment.transform:1 of
+msgid "Randomly choose a sub-policy to apply."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.RandomCrop.rst:7
+msgid "RandomCrop"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:12 of
+msgid ""
+"Desired output size of the crop. If crop_size is an int instead of sequence like (h, w), a square crop "
+"(crop_size, crop_size) is made."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:16 of
+msgid ""
+"Optional padding on each border of the image. If a sequence of length 4 is provided, it is used to pad "
+"left, top, right, bottom borders respectively. If a sequence of length 2 is provided, it is used to pad "
+"left/right, top/bottom borders, respectively. Default: None, which means no padding."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:22 of
+msgid ""
+"It will pad the image if smaller than the desired size to avoid raising an exception. Since cropping is "
+"done after padding, the padding seems to be done at a random offset. Default: False."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:27 of
+msgid ""
+"Pixel pad_val value for constant fill. If a tuple of length 3, it is used to pad_val R, G, B channels "
+"respectively. Default: 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:31 of
+msgid ""
+"Type of padding. Defaults to \"constant\". Should be one of the following: - ``constant``: Pads with a "
+"constant value, this value is specified with pad_val. - ``edge``: pads with the last value at the edge of "
+"the image. - ``reflect``: Pads with reflection of image without repeating the last value on the edge. For "
+"example, padding [1, 2, 3, 4] with 2 elements on both sides in reflect mode will result in [3, 2, 1, 2, "
+"3, 4, 3, 2]. - ``symmetric``: Pads with reflection of image repeating the last value on the edge. For "
+"example, padding [1, 2, 3, 4] with 2 elements on both sides in symmetric mode will result in [2, 1, 1, "
+"2, 3, 4, 4, 3]."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:31 of
+msgid "Type of padding. Defaults to \"constant\". Should be one of the following:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:34 of
+msgid "``constant``: Pads with a constant value, this value is specified with pad_val."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:36 of
+msgid "``edge``: pads with the last value at the edge of the image."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:37 of
+msgid ""
+"``reflect``: Pads with reflection of image without repeating the last value on the edge. For example, "
+"padding [1, 2, 3, 4] with 2 elements on both sides in reflect mode will result in [3, 2, 1, 2, 3, 4, 3, 2]."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:41 of
+msgid ""
+"``symmetric``: Pads with reflection of image repeating the last value on the edge. For example, padding [1, "
+"2, 3, 4] with 2 elements on both sides in symmetric mode will result in [2, 1, 1, 2, 3, 4, 4, 3]."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:1 of
+msgid "Transform function to randomly crop images."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:6 of
+msgid "Randomly cropped results, 'img_shape' key in result dict is updated according to crop size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:8 of
+msgid "Randomly cropped results, 'img_shape'"
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.RandomErasing.rst:7
+msgid "RandomErasing"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:11 of
+msgid "Probability that image will be randomly erased. Default: 0.5"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:14 of
+msgid "Minimum erased area / input image area Default: 0.02"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:17 of
+msgid "Maximum erased area / input image area Default: 0.4"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:20 of
+msgid ""
+"Aspect ratio range of erased area. if float, it will be converted to (aspect_ratio, 1/aspect_ratio) "
+"Default: (3/10, 10/3)"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:24 of
+msgid ""
+"Fill method in erased area, can be: - const (default): All pixels are assign with the same value. - rand: "
+"each pixel is assigned with a random value in [0, 255]"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:24 of
+msgid "Fill method in erased area, can be:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:26 of
+msgid "const (default): All pixels are assign with the same value."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:27 of
+msgid "rand: each pixel is assigned with a random value in [0, 255]"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:29 of
+msgid "Base color filled in erased area. Defaults to (128, 128, 128)."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:32 of
+msgid ""
+"If set and ``mode`` is 'rand', fill erased area with random color from normal distribution "
+"(mean=fill_color, std=fill_std); If not set, fill erased area with random color from uniform distribution "
+"(0~255). Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:40 of
+msgid "See `Random Erasing Data Augmentation `_"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:43 of
+msgid ""
+"This paper provided 4 modes: RE-R, RE-M, RE-0, RE-255, and use RE-M as default. The config of these 4 modes "
+"are:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:46 of
+msgid "RE-R: RandomErasing(mode='rand')"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:47 of
+msgid "RE-M: RandomErasing(mode='const', fill_color=(123.67, 116.3, 103.5))"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:48 of
+msgid "RE-0: RandomErasing(mode='const', fill_color=0)"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:49 of
+msgid "RE-255: RandomErasing(mode='const', fill_color=255)"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing.transform:1 of
+msgid "Results dict from pipeline"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing.transform:4 of
+msgid "Results after the transformation."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.RandomResizedCrop.rst:7
+msgid "RandomResizedCrop"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:3 of
+msgid ""
+"A crop of random size (default: of 0.08 to 1.0) of the original size and a random aspect ratio (default: of "
+"3/4 to 4/3) of the original aspect ratio is made. This crop is finally resized to given size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:16 of
+msgid ""
+"Desired output scale of the crop. If size is an int instead of sequence like (h, w), a square crop (size, "
+"size) is made."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:30 of
+msgid ""
+"Interpolation method, accepted values are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to "
+"'bilinear'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:6 of
+msgid ""
+"Randomly resized cropped results, 'img_shape' key in result dict is updated according to crop size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:8 of
+msgid "Randomly resized cropped results, 'img_shape'"
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ResizeEdge.rst:7
+msgid "ResizeEdge"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:14 of
+msgid "scale"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:15 of
+msgid "scale_factor"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:17 of
+msgid "The edge scale to resizing."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:19 of
+msgid "The edge to resize. Defaults to 'short'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:21 of
+msgid ""
+"Image resize backend, choices are 'cv2' and 'pillow'. These two backends generates slightly different "
+"results. Defaults to 'cv2'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:25 of
+msgid ""
+"Interpolation method, accepted values are \"nearest\", \"bilinear\", \"bicubic\", \"area\", \"lanczos\" for "
+"'cv2' backend, \"nearest\", \"bilinear\" for 'pillow' backend. Defaults to 'bilinear'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform:6 of
+msgid "Resized results, 'img', 'scale', 'scale_factor', 'img_shape' keys are updated in result dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Rotate.rst:7
+msgid "Rotate"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:3 of
+msgid ""
+"The angle used for rotate. Positive values stand for clockwise rotation. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:8 of
+msgid ""
+"Center point (w, h) of the rotation in the source image. If None, the center of the image will be used. "
+"Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:12 of
+msgid "Isotropic scale factor. Defaults to 1.0."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:14 mmcls.datasets.transforms.auto_augment.Shear:7
+#: mmcls.datasets.transforms.auto_augment.Translate:9 of
+msgid ""
+"Pixel pad_val value for constant fill. If a sequence of length 3, it is used to pad_val R, G, B channels "
+"respectively. Defaults to 128."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:18 of
+msgid "The probability for performing rotate therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:21 of
+msgid "The probability that turns the angle negative, which should be in range [0,1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:24 mmcls.datasets.transforms.auto_augment.Translate:22 of
+msgid ""
+"Interpolation method. Options are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to "
+"'nearest'."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Sharpness.rst:7
+msgid "Sharpness"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Sharpness:3 of
+msgid ""
+"The magnitude used for adjusting sharpness. A positive magnitude would enhance the sharpness and a negative "
+"magnitude would make the image bulr. A magnitude=0 gives the origin img. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Sharpness:9 of
+msgid ""
+"The probability for performing sharpness adjusting therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Shear.rst:7
+msgid "Shear"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Shear:3 of
+msgid ""
+"The magnitude used for shear. If None, generate from ``magnitude_range``, see :class:`BaseAugTransform`. "
+"Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Shear:11 of
+msgid "The probability for performing shear therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Shear:14 of
+msgid "The shearing direction. Options are 'horizontal' and 'vertical'. Defaults to 'horizontal'."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Shear:20 of
+msgid ""
+"Interpolation method. Options are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to "
+"'bicubic'."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Solarize.rst:7
+msgid "Solarize"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Solarize:3 of
+msgid ""
+"The threshold above which the pixels value will be inverted. If None, generate from ``magnitude_range``, "
+"see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Solarize:7 mmcls.datasets.transforms.auto_augment.SolarizeAdd:10 of
+msgid "The probability for solarizing therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.SolarizeAdd.rst:7
+msgid "SolarizeAdd"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.SolarizeAdd:3 of
+msgid ""
+"The value to be added to pixels below the thr. If None, generate from ``magnitude_range``, see :class:"
+"`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.SolarizeAdd:7 of
+msgid "The threshold below which the pixels value will be adjusted."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ToNumpy.rst:7
+msgid "ToNumpy"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToNumpy:5 mmcls.datasets.transforms.formatting.ToNumpy:9 of
+msgid "``*keys**``"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToNumpy:11 of
+msgid "The dtype of the converted numpy array. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToNumpy.transform:1 of
+msgid "Method to convert object to :obj:`numpy.ndarray`."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ToPIL.rst:7
+msgid "ToPIL"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToPIL.transform:1 of
+msgid "Method to convert images to :obj:`PIL.Image.Image`."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Translate.rst:7
+msgid "Translate"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Translate:3 of
+msgid ""
+"The magnitude used for translate. Note that the offset is calculated by magnitude * size in the "
+"corresponding direction. With a magnitude of 1, the whole image will be moved out of the range. If None, "
+"generate from ``magnitude_range``, see :class:`BaseAugTransform`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Translate:13 of
+msgid "The probability for performing translate therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Translate:16 of
+msgid "The translating direction. Options are 'horizontal' and 'vertical'. Defaults to 'horizontal'."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Transpose.rst:7
+msgid "Transpose"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Transpose:11 of
+msgid "The fields to convert to tensor."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Transpose:13 of
+msgid "The output dimensions order."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Transpose.transform:1 of
+msgid "Method to transpose array."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.ClassNumCheckHook.rst:7
+msgid "ClassNumCheckHook"
+msgstr ""
+
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_test:1 of
+msgid "Check whether the test dataset is compatible with head."
+msgstr ""
+
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_test:3
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_train:3
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_val:3 of
+msgid "`IterBasedRunner`): Iter based Runner."
+msgstr ""
+
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_train:1 of
+msgid "Check whether the training dataset is compatible with head."
+msgstr ""
+
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_val:1 of
+msgid "Check whether the validation dataset is compatible with head."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.PreciseBNHook.rst:7
+msgid "PreciseBNHook"
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:3 of
+msgid ""
+"Recompute and update the batch norm stats to make them more precise. During training both BN stats and the "
+"weight are changing after every iteration, so the running average can not precisely reflect the actual "
+"stats of the current model."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:8 of
+msgid ""
+"With this hook, the BN stats are recomputed with fixed weights, to make the running average more precise. "
+"Specifically, it computes the true average of per-batch mean/variance instead of the running average. See "
+"Sec. 3 of the paper `Rethinking Batch in BatchNorm ` for details."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:14 of
+msgid ""
+"This hook will update BN stats, so it should be executed before ``CheckpointHook`` and ``EMAHook``, "
+"generally set its priority to \"ABOVE_NORMAL\"."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:18 of
+msgid "The number of samples to update the bn stats. Defaults to 8192."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:21 of
+msgid "Perform precise bn interval. If the train loop is"
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:23 mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:25 of
+msgid "train loop is `IterBasedTrainLoop` or `by_epoch=False`, its unit is 'iter'. Defaults to 1."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_epoch:1
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter:1 of
+msgid "Calculate prcise BN and broadcast BN stats across GPUs."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_epoch:3
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter:3 of
+msgid "`Runner`): The runner of the training process."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter:4 of
+msgid "The index of the current batch in the train loop."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter:6 of
+msgid "Data from dataloader. Defaults to None."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.PrepareProtoBeforeValLoopHook.rst:7
+msgid "PrepareProtoBeforeValLoopHook"
+msgstr ""
+
+#: mmcls.engine.hooks.retriever_hooks.PrepareProtoBeforeValLoopHook:3 of
+msgid ""
+"Since the encoders of the retriever changes during training, the prototype changes accordingly. So the "
+"`prototype_vecs` needs to be regenerated before validation loop."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.SetAdaptiveMarginsHook.rst:7
+msgid "SetAdaptiveMarginsHook"
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:4 of
+msgid ""
+"A PyTorch implementation of paper `Google Landmark Recognition 2020 Competition Third Place Solution "
+"`_. The margins will be :math:`\\text{f}(n) = (marginMax - marginMin) · "
+"norm(n^p) + marginMin`. The `n` indicates the number of occurrences of a category."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:10 of
+msgid "Lower bound of margins. Defaults to 0.05."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:12 of
+msgid "Upper bound of margins. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:14 of
+msgid "The power of category freqercy. Defaults to -0.25."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook.before_train:1 of
+msgid "change the margins in ArcFaceClsHead."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook.before_train:3 of
+msgid "`Runner`): Runner."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.VisualizationHook.rst:7
+msgid "VisualizationHook"
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:1 of
+msgid "Classification Visualization Hook. Used to visualize validation and testing prediction results."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:4 of
+msgid "If ``out_dir`` is specified, all storage backends are ignored and save the image to the ``out_dir``."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:6 of
+msgid ""
+"If ``show`` is True, plot the result image in a window, please confirm you are able to access the graphical "
+"interface."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:9 of
+msgid "Whether to enable this hook. Defaults to False."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:11 of
+msgid "The interval of samples to visualize. Defaults to 5000."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:13 of
+msgid "Whether to display the drawn image. Defaults to False."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:15 of
+msgid ""
+"directory where painted images will be saved in the testing process. If None, handle with the backends of "
+"the visualizer. Defaults to None."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:19 of
+msgid "other keyword arguments of :meth:`mmcls.visualization.ClsVisualizer.add_datasample`."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:1 of
+msgid "Visualize every ``self.interval`` samples during test."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:3 of
+msgid "The runner of the testing process."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:5 of
+msgid "The index of the current batch in the test loop."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:7
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:7 of
+msgid "Data from dataloader."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:9
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:9 of
+msgid "Outputs from model."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:1 of
+msgid "Visualize every ``self.interval`` samples during validation."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:3 of
+msgid "The runner of the validation process."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:5 of
+msgid "The index of the current batch in the val loop."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.optimizers.Lamb.rst:7
+msgid "Lamb"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:3 of
+msgid ""
+"This class is copied from `timm`_. The LAMB was proposed in `Large Batch Optimization for Deep Learning - "
+"Training BERT in 76 minutes`_."
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:11 of
+msgid "iterable of parameters to optimize or dicts defining"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:14 of
+msgid "learning rate. (default: 1e-3)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:16 of
+msgid "coefficients used for computing running averages of gradient and its norm. (default: (0.9, 0.999))"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:19 of
+msgid "term added to the denominator to improve numerical stability. (default: 1e-8)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:22 of
+msgid "weight decay (L2 penalty) (default: 0)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:24 of
+msgid "whether apply (1-beta2) to grad when calculating running averages of gradient. (default: True)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:27 of
+msgid "value used to clip global grad norm (default: 1.0)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:30 of
+msgid "enable LAMBC trust ratio clipping (default: False)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:32 of
+msgid "Apply adaptive learning rate to 0.0 weight decay parameter (default: False)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb.step:1 of
+msgid "Performs a single optimization step."
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb.step:3 of
+msgid "A closure that reevaluates the model and returns the loss."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.Accuracy.rst:7
+msgid "Accuracy"
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy:3 of
+msgid ""
+"For either binary classification or multi-class classification, the accuracy is the fraction of correct "
+"predictions in all predictions:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy:6 of
+msgid "\\text{Accuracy} = \\frac{N_{\\text{correct}}}{N_{\\text{all}}}"
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy:10 of
+msgid ""
+"If the ground truth label matches one of the best **k** predictions, the sample will be regard as a "
+"positive prediction. If the parameter is a tuple, all of top-k accuracy will be calculated and outputted "
+"together. Defaults to 1."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy:15 of
+msgid ""
+"If a float, predictions with score lower than the threshold will be regard as the negative prediction. If "
+"None, not apply threshold. If the parameter is a tuple, accuracy based on all thresholds will be calculated "
+"and outputted together. Defaults to 0."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:22
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:58 mmcls.evaluation.metrics.single_label.Accuracy:21
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:57 of
+msgid ""
+"Device name used for collecting results from different ranks during distributed training. Must be 'cpu' or "
+"'gpu'. Defaults to 'cpu'."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:26
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:62 mmcls.evaluation.metrics.single_label.Accuracy:25
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:61 of
+msgid ""
+"The prefix that will be added in the metric names to disambiguate homonymous metrics of different "
+"evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults "
+"to None."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:1 of
+msgid "Calculate the accuracy."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:3
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:3 of
+msgid "The prediction results. It can be labels (N, ), or scores of every class (N, C)."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:7
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:7 of
+msgid "The target of each prediction with shape (N, )."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:10
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:10 of
+msgid ""
+"Predictions with scores under the thresholds are considered negative. It's only used when ``pred`` is "
+"scores. None means no thresholds. Defaults to (0., )."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:15 of
+msgid ""
+"Predictions with scores under the thresholds are considered negative. It's only used when ``pred`` is "
+"scores. Defaults to (0., )."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:20 of
+msgid ""
+"Accuracy. - torch.Tensor: If the ``pred`` is a sequence of label instead of score (number of dimensions "
+"is 1). Only return a top-1 accuracy tensor, and ignore the argument ``topk` and ``thrs``. - "
+"List[List[torch.Tensor]]: If the ``pred`` is a sequence of score (number of dimensions is 2). Return the "
+"accuracy on each ``topk`` and ``thrs``. And the first dim is ``topk``, the second dim is ``thrs``."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:20 of
+msgid "Accuracy."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:22 of
+msgid ""
+"torch.Tensor: If the ``pred`` is a sequence of label instead of score (number of dimensions is 1). Only "
+"return a top-1 accuracy tensor, and ignore the argument ``topk` and ``thrs``."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:25 of
+msgid ""
+"List[List[torch.Tensor]]: If the ``pred`` is a sequence of score (number of dimensions is 2). Return the "
+"accuracy on each ``topk`` and ``thrs``. And the first dim is ``topk``, the second dim is ``thrs``."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25: