add part code

495d9ed9 · limm · 59b09903 · 495d9ed9 · 495d9ed9 · 495d9ed9
Commit 495d9ed9 authored Jun 24, 2025 by limm
20 changed files
--- a/configs/densenet/densenet121_4xb256_in1k.py
+++ b/configs/densenet/densenet121_4xb256_in1k.py
+_base_ = [
+    '../_base_/models/densenet/densenet121.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+# dataset settings
+train_dataloader = dict(batch_size=256)
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
--- a/configs/densenet/densenet161_4xb256_in1k.py
+++ b/configs/densenet/densenet161_4xb256_in1k.py
+_base_ = [
+    '../_base_/models/densenet/densenet161.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+# dataset settings
+train_dataloader = dict(batch_size=256)
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
--- a/configs/densenet/densenet169_4xb256_in1k.py
+++ b/configs/densenet/densenet169_4xb256_in1k.py
+_base_ = [
+    '../_base_/models/densenet/densenet169.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+# dataset settings
+train_dataloader = dict(batch_size=256)
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
--- a/configs/densenet/densenet201_4xb256_in1k.py
+++ b/configs/densenet/densenet201_4xb256_in1k.py
+_base_ = [
+    '../_base_/models/densenet/densenet201.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+# dataset settings
+train_dataloader = dict(batch_size=256)
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
--- a/configs/densenet/metafile.yml
+++ b/configs/densenet/metafile.yml
+Collections:
+  - Name: DenseNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - DenseBlock
+    Paper:
+      URL: https://arxiv.org/abs/1608.06993
+      Title: Densely Connected Convolutional Networks
+    README: configs/densenet/README.md
+Models:
+  - Name: densenet121_3rdparty_in1k
+    Metadata:
+      FLOPs: 2881695488
+      Parameters: 7978856
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.96
+          Top 5 Accuracy: 92.21
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
+    Config: configs/densenet/densenet121_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet121-a639ec97.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet169_3rdparty_in1k
+    Metadata:
+      FLOPs: 3416860160
+      Parameters: 14149480
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.08
+          Top 5 Accuracy: 93.11
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth
+    Config: configs/densenet/densenet169_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet169-b2777c0a.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet201_3rdparty_in1k
+    Metadata:
+      FLOPs: 4365236736
+      Parameters: 20013928
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.32
+          Top 5 Accuracy: 93.64
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth
+    Config: configs/densenet/densenet201_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet201-c1103571.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet161_3rdparty_in1k
+    Metadata:
+      FLOPs: 7816363968
+      Parameters: 28681000
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.61
+          Top 5 Accuracy: 93.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth
+    Config: configs/densenet/densenet161_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet161-8d451a50.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
--- a/configs/dinov2/README.md
+++ b/configs/dinov2/README.md
+# DINOv2
+> [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
+<!-- [ALGORITHM] -->
+## Abstract
+The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing allpurpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/234560516-b495795c-c75c-444c-a712-bb61a3de444e.png" width="70%"/>
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Use the model**
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('vit-small-p14_dinov2-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+<!-- [TABS-END] -->
+## Models and results
+### Pretrained models
+| Model                                 | Params (M) | Flops (G) |                     Config                     |                                              Download                                              |
+| :------------------------------------ | :--------: | :-------: | :--------------------------------------------: | :------------------------------------------------------------------------------------------------: |
+| `vit-small-p14_dinov2-pre_3rdparty`\* |   22.06    |   46.76   | [config](vit-small-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth) |
+| `vit-base-p14_dinov2-pre_3rdparty`\*  |   86.58    |  152.00   | [config](vit-base-p14_dinov2-pre_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth) |
+| `vit-large-p14_dinov2-pre_3rdparty`\* |   304.00   |  507.00   | [config](vit-large-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth) |
+| `vit-giant-p14_dinov2-pre_3rdparty`\* |  1136.00   |  1784.00  | [config](vit-giant-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth) |
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/dinov2). The config files of these models are only for inference. We haven't reproduce the training results.*
+## Citation
+```bibtex
+@misc{oquab2023dinov2,
+  title={DINOv2: Learning Robust Visual Features without Supervision},
+  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
+  journal={arXiv:2304.07193},
+  year={2023}
+}
+```
--- a/configs/dinov2/metafile.yml
+++ b/configs/dinov2/metafile.yml
+Collections:
+  - Name: DINOv2
+    Metadata:
+      Architecture:
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'DINOv2: Learning Robust Visual Features without Supervision'
+      URL: https://arxiv.org/abs/2304.07193
+    README: configs/dinov2/README.md
+    Code:
+      URL: null
+      Version: null
+Models:
+  - Name: vit-small-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 46762000000
+      Parameters: 22056000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth
+    Config: configs/dinov2/vit-small-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+  - Name: vit-base-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 152000000000
+      Parameters: 86580000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth
+    Config: configs/dinov2/vit-base-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+  - Name: vit-large-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 507000000000
+      Parameters: 304000000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth
+    Config: configs/dinov2/vit-large-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+  - Name: vit-giant-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 1784000000000
+      Parameters: 1136000000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth
+    Config: configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
--- a/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
+++ b/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
+++ b/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='dinov2-giant',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+        layer_cfgs=dict(ffn_type='swiglu_fused'),
+    ),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
+++ b/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
+++ b/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='dinov2-small',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/edgenext/README.md
+++ b/configs/edgenext/README.md
+# EdgeNeXt
+> [EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications](https://arxiv.org/abs/2206.10589)
+<!-- [ALGORITHM] -->
+## Abstract
+In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
+<div align=center>
+<img src="https://github.com/mmaaz60/EdgeNeXt/raw/main/images/EdgeNext.png" width="100%"/>
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Predict image**
+```python
+from mmpretrain import inference_model
+predict = inference_model('edgenext-xxsmall_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+**Use the model**
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('edgenext-xxsmall_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+**Test Command**
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+Test:
+```shell
+python tools/test.py configs/edgenext/edgenext-xxsmall_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
+```
+<!-- [TABS-END] -->
+## Models and results
+### Image Classification on ImageNet-1k
+| Model                                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                    |                                 Download                                 |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :----------------------------------------------------------------------: |
+| `edgenext-xxsmall_3rdparty_in1k`\*   | From scratch |    1.33    |   0.26    |   71.20   |   89.91   |  [config](edgenext-xxsmall_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth) |
+| `edgenext-xsmall_3rdparty_in1k`\*    | From scratch |    2.34    |   0.53    |   74.86   |   92.31   |  [config](edgenext-xsmall_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth) |
+| `edgenext-small_3rdparty_in1k`\*     | From scratch |    5.59    |   1.25    |   79.41   |   94.53   |   [config](edgenext-small_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth) |
+| `edgenext-small-usi_3rdparty_in1k`\* | From scratch |    5.59    |   1.25    |   81.06   |   95.34   | [config](edgenext-small_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth) |
+| `edgenext-base_3rdparty_in1k`\*      | From scratch |   18.51    |   3.81    |   82.48   |   96.20   |   [config](edgenext-base_8xb256_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth) |
+| `edgenext-base_3rdparty-usi_in1k`\*  | From scratch |   18.51    |   3.81    |   83.67   |   96.70   | [config](edgenext-base_8xb256-usi_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth) |
+*Models with * are converted from the [official repo](https://github.com/mmaaz60/EdgeNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+## Citation
+```bibtex
+@article{Maaz2022EdgeNeXt,
+    title={EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications},
+    author={Muhammad Maaz and Abdelrahman Shaker and Hisham Cholakkal and Salman Khan and Syed Waqas Zamir and Rao Muhammad Anwer and Fahad Shahbaz Khan},
+    journal={2206.10589},
+    year={2022}
+}
+```
--- a/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
+++ b/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
+_base_ = ['./edgenext-base_8xb256_in1k.py']
+# dataset setting
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=269,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
--- a/configs/edgenext/edgenext-base_8xb256_in1k.py
+++ b/configs/edgenext/edgenext-base_8xb256_in1k.py
+_base_ = [
+    '../_base_/models/edgenext/edgenext-base.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
--- a/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
+++ b/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
+_base_ = ['./edgenext-small_8xb256_in1k.py']
+# dataset setting
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=269,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
--- a/configs/edgenext/edgenext-small_8xb256_in1k.py
+++ b/configs/edgenext/edgenext-small_8xb256_in1k.py
+_base_ = [
+    '../_base_/models/edgenext/edgenext-small.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
--- a/configs/edgenext/edgenext-xsmall_8xb256_in1k.py
+++ b/configs/edgenext/edgenext-xsmall_8xb256_in1k.py
+_base_ = [
+    '../_base_/models/edgenext/edgenext-xsmall.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
--- a/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
+++ b/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
+_base_ = [
+    '../_base_/models/edgenext/edgenext-xxsmall.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
--- a/configs/edgenext/metafile.yml
+++ b/configs/edgenext/metafile.yml
+Collections:
+  - Name: EdgeNeXt
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - SDTA
+        - 1x1 Convolution
+        - Channel Self-attention
+    Paper:
+      URL: https://arxiv.org/abs/2206.10589
+      Title: 'EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications'
+    README: configs/edgenext/README.md
+    Code:
+      Version: v1.0.0rc1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.2/mmcls/models/backbones/edgenext.py
+Models:
+  - Name: edgenext-xxsmall_3rdparty_in1k
+    Metadata:
+      FLOPs: 255640144
+      Parameters: 1327216
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.20
+          Top 5 Accuracy: 89.91
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
+    Config: configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xxsmall.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-xsmall_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 529970560
+      Parameters: 2336804
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.86
+          Top 5 Accuracy: 92.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth
+    Config: configs/edgenext/edgenext-xsmall_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xsmall.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-small_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1249788000
+      Parameters: 5586832
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.41
+          Top 5 Accuracy: 94.53
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth
+    Config: configs/edgenext/edgenext-small_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_small.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-small-usi_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1249788000
+      Parameters: 5586832
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.06
+          Top 5 Accuracy: 95.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth
+    Config: configs/edgenext/edgenext-small_8xb256-usi_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.1/edgenext_small_usi.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-base_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3814395280
+      Parameters: 18511292
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.48
+          Top 5 Accuracy: 96.2
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth
+    Config: configs/edgenext/edgenext-base_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-base_3rdparty-usi_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3814395280
+      Parameters: 18511292
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.67
+          Top 5 Accuracy: 96.7
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth
+    Config: configs/edgenext/edgenext-base_8xb256-usi_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base_usi.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
--- a/configs/efficientformer/README.md
+++ b/configs/efficientformer/README.md
+# EfficientFormer
+> [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191)
+<!-- [ALGORITHM] -->
+## Abstract
+Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm.  Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/180713426-9d3d77e3-3584-42d8-9098-625b4170d796.png" width="100%"/>
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Predict image**
+```python
+from mmpretrain import inference_model
+predict = inference_model('efficientformer-l1_3rdparty_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+**Use the model**
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('efficientformer-l1_3rdparty_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+**Test Command**
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+Test:
+```shell
+python tools/test.py configs/efficientformer/efficientformer-l1_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth
+```
+<!-- [TABS-END] -->
+## Models and results
+### Image Classification on ImageNet-1k
+| Model                                       |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                    |                             Download                              |
+| :------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :---------------------------------------------------------------: |
+| `efficientformer-l1_3rdparty_8xb128_in1k`\* | From scratch |   12.28    |   1.30    |   80.46   |   94.99   | [config](efficientformer-l1_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth) |
+| `efficientformer-l3_3rdparty_8xb128_in1k`\* | From scratch |   31.41    |   3.74    |   82.45   |   96.18   | [config](efficientformer-l3_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l3_3rdparty_in1k_20220915-466793d6.pth) |
+| `efficientformer-l7_3rdparty_8xb128_in1k`\* | From scratch |   82.23    |   10.16   |   83.40   |   96.60   | [config](efficientformer-l7_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l7_3rdparty_in1k_20220915-185e30af.pth) |
+*Models with * are converted from the [official repo](https://github.com/snap-research/EfficientFormer). The config files of these models are only for inference. We haven't reproduce the training results.*
+## Citation
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2206.01191,
+  doi = {10.48550/ARXIV.2206.01191},
+  url = {https://arxiv.org/abs/2206.01191},
+  author = {Li, Yanyu and Yuan, Geng and Wen, Yang and Hu, Eric and Evangelidis, Georgios and Tulyakov, Sergey and Wang, Yanzhi and Ren, Jian},
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {EfficientFormer: Vision Transformers at MobileNet Speed},
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {Creative Commons Attribution 4.0 International}
+}
+```