upload mmpretrain

d476eeba · renzhc · 62b8498e · d476eeba · d476eeba · d476eeba
Commit d476eeba authored Sep 02, 2024 by renzhc
20 changed files
--- a/configs/mobileone/README.md
+++ b/configs/mobileone/README.md
+# MobileOne
+
+> [An Improved One millisecond Mobile Backbone](https://arxiv.org/abs/2206.04040)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+Mobileone is proposed by apple and based on reparameterization. On the apple chips, the accuracy of the model is close to 0.76 on the ImageNet dataset when the latency is less than 1ms. Its main improvements based on [RepVGG](../repvgg) are fllowing:
+
+- Reparameterization using Depthwise convolution and Pointwise convolution instead of normal convolution.
+- Removal of the residual structure which is not friendly to access memory.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/183552452-74657532-f461-48f7-9aa7-c23f006cdb07.png" width="40%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device. Therefore, we perform extensive analysis of different metrics by deploying several mobile-friendly networks on a mobile device. We identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks. To this end, we design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile. Our best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. Our model obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, we show that our model generalizes to multiple tasks - image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobileone-s0_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobileone-s0_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobileone/mobileone-s0_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobileone/mobileone-s0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                          Download                                          |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `mobileone-s0_8xb32_in1k` | From scratch |    2.08    |   0.27    |   71.34   |   89.87   | [config](mobileone-s0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.json) |
+| `mobileone-s1_8xb32_in1k` | From scratch |    4.76    |   0.82    |   75.72   |   92.54   | [config](mobileone-s1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.json) |
+| `mobileone-s2_8xb32_in1k` | From scratch |    7.81    |   1.30    |   77.37   |   93.34   | [config](mobileone-s2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.json) |
+| `mobileone-s3_8xb32_in1k` | From scratch |   10.08    |   1.89    |   78.06   |   93.83   | [config](mobileone-s3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.json) |
+| `mobileone-s4_8xb32_in1k` | From scratch |   14.84    |   2.98    |   79.69   |   94.46   | [config](mobileone-s4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.json) |
+
+## Citation
+
+```bibtex
+@article{mobileone2022,
+  title={An Improved One millisecond Mobile Backbone},
+  author={Vasu, Pavan Kumar Anasosalu and Gabriel, James and Zhu, Jeff and Tuzel, Oncel and Ranjan, Anurag},
+  journal={arXiv preprint arXiv:2206.04040},
+  year={2022}
+}
+```
--- a/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py
+++ b/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py
+_base_ = ['../mobileone-s0_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
--- a/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py
+++ b/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py
+_base_ = ['../mobileone-s1_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
--- a/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py
+++ b/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py
+_base_ = ['../mobileone-s2_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
--- a/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py
+++ b/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py
+_base_ = ['../mobileone-s3_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
--- a/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py
+++ b/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py
+_base_ = ['../mobileone-s4_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
--- a/configs/mobileone/metafile.yml
+++ b/configs/mobileone/metafile.yml
+Collections:
+  - Name: MobileOne
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - re-parameterization Convolution
+        - VGG-style Neural Network
+        - Depthwise Convolution
+        - Pointwise Convolution
+    Paper:
+      URL: https://arxiv.org/abs/2206.04040
+      Title: 'An Improved One millisecond Mobile Backbone'
+    README: configs/mobileone/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/mobileone/metafile.yml
+      Version: v1.0.0rc1
+
+Models:
+  - Name: mobileone-s0_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s0_8xb32_in1k.py
+    Metadata:
+      FLOPs: 274136576  # 0.27G
+      Parameters: 2078504  # 2.08M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 71.34
+        Top 5 Accuracy: 89.87
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth
+  - Name: mobileone-s1_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s1_8xb32_in1k.py
+    Metadata:
+      FLOPs: 823839744    # 8.6G
+      Parameters: 4764840  # 4.82M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 75.72
+        Top 5 Accuracy: 92.54
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.pth
+  - Name: mobileone-s2_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s2_8xb32_in1k.py
+    Metadata:
+      FLOPs: 1296478848
+      Parameters: 7808168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 77.37
+        Top 5 Accuracy: 93.34
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.pth
+  - Name: mobileone-s3_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s3_8xb32_in1k.py
+    Metadata:
+      FLOPs: 1893842944
+      Parameters: 10078312
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.06
+        Top 5 Accuracy: 93.83
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.pth
+  - Name: mobileone-s4_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s4_8xb32_in1k.py
+    Metadata:
+      FLOPs: 2979222528
+      Parameters: 14838352
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.69
+        Top 5 Accuracy: 94.46
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.pth
--- a/configs/mobileone/mobileone-s0_8xb32_in1k.py
+++ b/configs/mobileone/mobileone-s0_8xb32_in1k.py
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s0.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
--- a/configs/mobileone/mobileone-s1_8xb32_in1k.py
+++ b/configs/mobileone/mobileone-s1_8xb32_in1k.py
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s1.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+import copy  # noqa: E402
+
+# modify start epoch's RandomResizedCrop.scale to 160
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.1
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+# modify 37 epoch's RandomResizedCrop.scale to 192
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_1e[3]['magnitude_level'] *= 0.2
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
--- a/configs/mobileone/mobileone-s2_8xb32_in1k.py
+++ b/configs/mobileone/mobileone-s2_8xb32_in1k.py
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s2.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+import copy  # noqa: E402
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+import copy  # noqa: E402
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
--- a/configs/mobileone/mobileone-s3_8xb32_in1k.py
+++ b/configs/mobileone/mobileone-s3_8xb32_in1k.py
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s3.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+import copy  # noqa: E402
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+import copy  # noqa: E402
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
--- a/configs/mobileone/mobileone-s4_8xb32_in1k.py
+++ b/configs/mobileone/mobileone-s4_8xb32_in1k.py
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s4.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+import copy  # noqa: E402
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
--- a/configs/mobilevit/README.md
+++ b/configs/mobilevit/README.md
+# MobileViT
+
+> [MobileViT Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**MobileViT** aims at introducing a light-weight network, which takes the advantages of both ViTs and CNNs, uses the `InvertedResidual` blocks in [MobileNetV2](../mobilenet_v2/README.md) and `MobileViTBlock` which refers to [ViT](../vision_transformer/README.md) transformer blocks to build a standard 5-stage model structure.
+
+The MobileViTBlock reckons transformers as convolutions to perform a global representation, meanwhile conbined with original convolution layers for local representation to build a block with global receptive field. This is different from ViT, which adds an extra class token and position embeddings for learning relative relationship. Without any position embeddings, MobileViT can benfit from multi-scale inputs during training.
+
+Also, this paper puts forward a strategy for multi-scale training to dynamically adjust batch size based on the image size to both improve training efficiency and final performance.
+
+It is also proven effective in downstream tasks such as object detection and segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/193229983-822bf025-89a6-4d95-b6be-76b7f1a62f2c.png" width="70%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+
+Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilevit-small_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilevit-small_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mobilevit/mobilevit-small_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                               |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                   |                                  Download                                  |
+| :---------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------: |
+| `mobilevit-small_3rdparty_in1k`\*   | From scratch |    5.58    |   2.03    |   78.25   |   94.09   |  [config](mobilevit-small_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth) |
+| `mobilevit-xsmall_3rdparty_in1k`\*  | From scratch |    2.32    |   1.05    |   74.75   |   92.32   | [config](mobilevit-xsmall_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xsmall_3rdparty_in1k_20221018-be39a6e7.pth) |
+| `mobilevit-xxsmall_3rdparty_in1k`\* | From scratch |    1.27    |   0.42    |   69.02   |   88.91   | [config](mobilevit-xxsmall_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xxsmall_3rdparty_in1k_20221018-77835605.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/apple/ml-cvnets). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{mehta2021mobilevit,
+  title={MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
+  author={Mehta, Sachin and Rastegari, Mohammad},
+  journal={arXiv preprint arXiv:2110.02178},
+  year={2021}
+}
+```
--- a/configs/mobilevit/metafile.yml
+++ b/configs/mobilevit/metafile.yml
+Collections:
+  - Name: MobileViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - MobileViT Block
+    Paper:
+      URL: https://arxiv.org/abs/2110.02178
+      Title: MobileViT Light-weight, General-purpose, and Mobile-friendly Vision Transformer
+    README: configs/mobilevit/README.md
+
+Models:
+  - Name: mobilevit-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 2030000000
+      Parameters: 5580000
+    In Collection: MobileViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.25
+          Top 5 Accuracy: 94.09
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth
+    Config: configs/mobilevit/mobilevit-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_s.pt
+      Code: https://github.com/apple/ml-cvnets
+  - Name: mobilevit-xsmall_3rdparty_in1k
+    Metadata:
+      FLOPs: 1050000000
+      Parameters: 2320000
+    In Collection: MobileViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.75
+          Top 5 Accuracy: 92.32
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xsmall_3rdparty_in1k_20221018-be39a6e7.pth
+    Config: configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
+    Converted From:
+      Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_xs.pt
+      Code: https://github.com/apple/ml-cvnets
+  - Name: mobilevit-xxsmall_3rdparty_in1k
+    Metadata:
+      FLOPs: 420000000
+      Parameters: 1270000
+    In Collection: MobileViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.02
+          Top 5 Accuracy: 88.91
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xxsmall_3rdparty_in1k_20221018-77835605.pth
+    Config: configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
+    Converted From:
+      Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_xxs.pt
+      Code: https://github.com/apple/ml-cvnets
--- a/configs/mobilevit/mobilevit-small_8xb128_in1k.py
+++ b/configs/mobilevit/mobilevit-small_8xb128_in1k.py
+_base_ = [
+    '../_base_/models/mobilevit/mobilevit_s.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255],
+    # use bgr directly
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=288, edge='short'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+    batch_size=128,
+    dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
--- a/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
+++ b/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
+_base_ = [
+    '../_base_/models/mobilevit/mobilevit_xs.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255],
+    # use bgr directly
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=288, edge='short'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+    batch_size=128,
+    dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
--- a/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
+++ b/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
+_base_ = [
+    '../_base_/models/mobilevit/mobilevit_xxs.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255],
+    # use bgr directly
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=288, edge='short'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+    batch_size=128,
+    dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
--- a/configs/mocov2/README.md
+++ b/configs/mocov2/README.md
+# MoCoV2
+
+> [Improved Baselines with Momentum Contrastive Learning](https://arxiv.org/abs/2003.04297)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR’s design improvements by implementing them in the MoCo framework. With simple modifications to MoCo—namely, using an MLP projection head and more data augmentation—we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/149720067-b65e5736-d425-45b3-93ed-6f2427fc6217.png" width="500" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mocov2_resnet50_8xb32-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                   | Params (M) | Flops (G) |                       Config                       |                                           Download                                           |
+| :-------------------------------------- | :--------: | :-------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `mocov2_resnet50_8xb32-coslr-200e_in1k` |   55.93    |   4.11    | [config](mocov2_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k` | [MOCOV2](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth) |   25.56    |   4.11    |   67.50   | [config](benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.json) |
+
+## Citation
+
+```bibtex
+@article{chen2020improved,
+  title={Improved baselines with momentum contrastive learning},
+  author={Chen, Xinlei and Fan, Haoqi and Girshick, Ross and He, Kaiming},
+  journal={arXiv preprint arXiv:2003.04297},
+  year={2020}
+}
+```
--- a/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
+++ b/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_steplr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=30., momentum=0.9, weight_decay=0.))
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
--- a/configs/mocov2/metafile.yml
+++ b/configs/mocov2/metafile.yml
+Collections:
+  - Name: MoCoV2
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+        - MoCo
+    Paper:
+      Title: Improved Baselines with Momentum Contrastive Learning
+      URL: https://arxiv.org/abs/2003.04297
+    README: configs/mocov2/README.md
+
+Models:
+  - Name: mocov2_resnet50_8xb32-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 55933312
+      Training Data: ImageNet-1k
+    In Collection: MoCoV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth
+    Config: configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k
+  - Name: resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: MoCoV2
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 67.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth
+    Config: configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py