first commit

dff2c686 · renzhc · 8f9dd0ed · dff2c686 · dff2c686 · dff2c686
Commit dff2c686 authored Sep 03, 2024 by renzhc
20 changed files
--- a/configs/densecl/README.md
+++ b/configs/densecl/README.md
+# DenseCL
+> [Dense contrastive learning for self-supervised visual pre-training](https://arxiv.org/abs/2011.09157)
+<!-- [ALGORITHM] -->
+## Abstract
+To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/149721111-bab03a6d-a30d-418e-b338-43c3689cfc65.png" width="900" />
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Predict image**
+```python
+from mmpretrain import inference_model
+predict = inference_model('resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+**Use the model**
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('densecl_resnet50_8xb32-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+**Train/Test Command**
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+Train:
+```shell
+python tools/train.py configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+```
+Test:
+```shell
+python tools/test.py configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
+```
+<!-- [TABS-END] -->
+## Models and results
+### Pretrained models
+| Model                                    | Params (M) | Flops (G) |                       Config                        |                                          Download                                          |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------: | :----------------------------------------------------------------------------------------: |
+| `densecl_resnet50_8xb32-coslr-200e_in1k` |   64.85    |   4.11    | [config](densecl_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.json) |
+### Image Classification on ImageNet-1k
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k` | [DENSECL](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) |   25.56    |   4.11    |   63.50   | [config](benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.json) |
+## Citation
+```bibtex
+@inproceedings{wang2021dense,
+  title={Dense contrastive learning for self-supervised visual pre-training},
+  author={Wang, Xinlong and Zhang, Rufeng and Shen, Chunhua and Kong, Tao and Li, Lei},
+  booktitle={CVPR},
+  year={2021}
+}
+```
--- a/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
+++ b/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_steplr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=30., momentum=0.9, weight_decay=0.))
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
--- a/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+++ b/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_mocov2.py',
+    '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+# model settings
+model = dict(
+    type='DenseCL',
+    queue_len=65536,
+    feat_dim=128,
+    momentum=0.001,
+    loss_lambda=0.5,
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='BN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='DenseCLNeck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_grid=None),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.2),
+)
+find_unused_parameters = True
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
--- a/configs/densecl/metafile.yml
+++ b/configs/densecl/metafile.yml
+Collections:
+  - Name: DenseCL
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+    Paper:
+      Title: Dense contrastive learning for self-supervised visual pre-training
+      URL: https://arxiv.org/abs/2011.09157
+    README: configs/densecl/README.md
+Models:
+  - Name: densecl_resnet50_8xb32-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 64850560
+      Training Data: ImageNet-1k
+    In Collection: DenseCL
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth
+    Config: configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
+  - Name: resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: DenseCL
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 63.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
+    Config: configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
--- a/configs/densenet/README.md
+++ b/configs/densenet/README.md
+# DenseNet
+> [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993)
+<!-- [ALGORITHM] -->
+## Abstract
+Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance.
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/162675098-9a670883-b13a-4a5a-a9c9-06c39c616a0a.png" width="100%"/>
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Predict image**
+```python
+from mmpretrain import inference_model
+predict = inference_model('densenet121_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+**Use the model**
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('densenet121_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+**Test Command**
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+Test:
+```shell
+python tools/test.py configs/densenet/densenet121_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
+```
+<!-- [TABS-END] -->
+## Models and results
+### Image Classification on ImageNet-1k
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                        Download                                        |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `densenet121_3rdparty_in1k`\* | From scratch |    7.98    |   2.88    |   74.96   |   92.21   | [config](densenet121_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth) |
+| `densenet169_3rdparty_in1k`\* | From scratch |   14.15    |   3.42    |   76.08   |   93.11   | [config](densenet169_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth) |
+| `densenet201_3rdparty_in1k`\* | From scratch |   20.01    |   4.37    |   77.32   |   93.64   | [config](densenet201_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth) |
+| `densenet161_3rdparty_in1k`\* | From scratch |   28.68    |   7.82    |   77.61   |   93.83   | [config](densenet161_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth) |
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+## Citation
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.1608.06993,
+      doi = {10.48550/ARXIV.1608.06993},
+      url = {https://arxiv.org/abs/1608.06993},
+      author = {Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q.},
+      keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+      title = {Densely Connected Convolutional Networks},
+      publisher = {arXiv},
+      year = {2016},
+      copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```
--- a/configs/densenet/densenet121_4xb256_in1k.py
+++ b/configs/densenet/densenet121_4xb256_in1k.py
+_base_ = [
+    '../_base_/models/densenet/densenet121.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+# dataset settings
+train_dataloader = dict(batch_size=256)
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
--- a/configs/densenet/densenet161_4xb256_in1k.py
+++ b/configs/densenet/densenet161_4xb256_in1k.py
+_base_ = [
+    '../_base_/models/densenet/densenet161.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+# dataset settings
+train_dataloader = dict(batch_size=256)
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
--- a/configs/densenet/densenet169_4xb256_in1k.py
+++ b/configs/densenet/densenet169_4xb256_in1k.py
+_base_ = [
+    '../_base_/models/densenet/densenet169.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+# dataset settings
+train_dataloader = dict(batch_size=256)
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
--- a/configs/densenet/densenet201_4xb256_in1k.py
+++ b/configs/densenet/densenet201_4xb256_in1k.py
+_base_ = [
+    '../_base_/models/densenet/densenet201.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+# dataset settings
+train_dataloader = dict(batch_size=256)
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
--- a/configs/densenet/metafile.yml
+++ b/configs/densenet/metafile.yml
+Collections:
+  - Name: DenseNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - DenseBlock
+    Paper:
+      URL: https://arxiv.org/abs/1608.06993
+      Title: Densely Connected Convolutional Networks
+    README: configs/densenet/README.md
+Models:
+  - Name: densenet121_3rdparty_in1k
+    Metadata:
+      FLOPs: 2881695488
+      Parameters: 7978856
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.96
+          Top 5 Accuracy: 92.21
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
+    Config: configs/densenet/densenet121_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet121-a639ec97.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet169_3rdparty_in1k
+    Metadata:
+      FLOPs: 3416860160
+      Parameters: 14149480
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.08
+          Top 5 Accuracy: 93.11
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth
+    Config: configs/densenet/densenet169_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet169-b2777c0a.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet201_3rdparty_in1k
+    Metadata:
+      FLOPs: 4365236736
+      Parameters: 20013928
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.32
+          Top 5 Accuracy: 93.64
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth
+    Config: configs/densenet/densenet201_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet201-c1103571.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet161_3rdparty_in1k
+    Metadata:
+      FLOPs: 7816363968
+      Parameters: 28681000
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.61
+          Top 5 Accuracy: 93.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth
+    Config: configs/densenet/densenet161_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet161-8d451a50.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
--- a/configs/dinov2/README.md
+++ b/configs/dinov2/README.md
+# DINOv2
+> [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
+<!-- [ALGORITHM] -->
+## Abstract
+The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing allpurpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/234560516-b495795c-c75c-444c-a712-bb61a3de444e.png" width="70%"/>
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Use the model**
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('vit-small-p14_dinov2-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+<!-- [TABS-END] -->
+## Models and results
+### Pretrained models
+| Model                                 | Params (M) | Flops (G) |                     Config                     |                                              Download                                              |
+| :------------------------------------ | :--------: | :-------: | :--------------------------------------------: | :------------------------------------------------------------------------------------------------: |
+| `vit-small-p14_dinov2-pre_3rdparty`\* |   22.06    |   46.76   | [config](vit-small-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth) |
+| `vit-base-p14_dinov2-pre_3rdparty`\*  |   86.58    |  152.00   | [config](vit-base-p14_dinov2-pre_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth) |
+| `vit-large-p14_dinov2-pre_3rdparty`\* |   304.00   |  507.00   | [config](vit-large-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth) |
+| `vit-giant-p14_dinov2-pre_3rdparty`\* |  1136.00   |  1784.00  | [config](vit-giant-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth) |
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/dinov2). The config files of these models are only for inference. We haven't reproduce the training results.*
+## Citation
+```bibtex
+@misc{oquab2023dinov2,
+  title={DINOv2: Learning Robust Visual Features without Supervision},
+  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
+  journal={arXiv:2304.07193},
+  year={2023}
+}
+```
--- a/configs/dinov2/metafile.yml
+++ b/configs/dinov2/metafile.yml
+Collections:
+  - Name: DINOv2
+    Metadata:
+      Architecture:
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'DINOv2: Learning Robust Visual Features without Supervision'
+      URL: https://arxiv.org/abs/2304.07193
+    README: configs/dinov2/README.md
+    Code:
+      URL: null
+      Version: null
+Models:
+  - Name: vit-small-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 46762000000
+      Parameters: 22056000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth
+    Config: configs/dinov2/vit-small-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+  - Name: vit-base-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 152000000000
+      Parameters: 86580000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth
+    Config: configs/dinov2/vit-base-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+  - Name: vit-large-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 507000000000
+      Parameters: 304000000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth
+    Config: configs/dinov2/vit-large-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+  - Name: vit-giant-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 1784000000000
+      Parameters: 1136000000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth
+    Config: configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
--- a/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
+++ b/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
+++ b/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='dinov2-giant',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+        layer_cfgs=dict(ffn_type='swiglu_fused'),
+    ),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
+++ b/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
+++ b/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='dinov2-small',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/edgenext/README.md
+++ b/configs/edgenext/README.md
+# EdgeNeXt
+> [EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications](https://arxiv.org/abs/2206.10589)
+<!-- [ALGORITHM] -->
+## Abstract
+In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
+<div align=center>
+<img src="https://github.com/mmaaz60/EdgeNeXt/raw/main/images/EdgeNext.png" width="100%"/>
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Predict image**
+```python
+from mmpretrain import inference_model
+predict = inference_model('edgenext-xxsmall_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+**Use the model**
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('edgenext-xxsmall_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+**Test Command**
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+Test:
+```shell
+python tools/test.py configs/edgenext/edgenext-xxsmall_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
+```
+<!-- [TABS-END] -->
+## Models and results
+### Image Classification on ImageNet-1k
+| Model                                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                    |                                 Download                                 |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :----------------------------------------------------------------------: |
+| `edgenext-xxsmall_3rdparty_in1k`\*   | From scratch |    1.33    |   0.26    |   71.20   |   89.91   |  [config](edgenext-xxsmall_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth) |
+| `edgenext-xsmall_3rdparty_in1k`\*    | From scratch |    2.34    |   0.53    |   74.86   |   92.31   |  [config](edgenext-xsmall_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth) |
+| `edgenext-small_3rdparty_in1k`\*     | From scratch |    5.59    |   1.25    |   79.41   |   94.53   |   [config](edgenext-small_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth) |
+| `edgenext-small-usi_3rdparty_in1k`\* | From scratch |    5.59    |   1.25    |   81.06   |   95.34   | [config](edgenext-small_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth) |
+| `edgenext-base_3rdparty_in1k`\*      | From scratch |   18.51    |   3.81    |   82.48   |   96.20   |   [config](edgenext-base_8xb256_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth) |
+| `edgenext-base_3rdparty-usi_in1k`\*  | From scratch |   18.51    |   3.81    |   83.67   |   96.70   | [config](edgenext-base_8xb256-usi_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth) |
+*Models with * are converted from the [official repo](https://github.com/mmaaz60/EdgeNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+## Citation
+```bibtex
+@article{Maaz2022EdgeNeXt,
+    title={EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications},
+    author={Muhammad Maaz and Abdelrahman Shaker and Hisham Cholakkal and Salman Khan and Syed Waqas Zamir and Rao Muhammad Anwer and Fahad Shahbaz Khan},
+    journal={2206.10589},
+    year={2022}
+}
+```
--- a/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
+++ b/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
+_base_ = ['./edgenext-base_8xb256_in1k.py']
+# dataset setting
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=269,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
--- a/configs/edgenext/edgenext-base_8xb256_in1k.py
+++ b/configs/edgenext/edgenext-base_8xb256_in1k.py
+_base_ = [
+    '../_base_/models/edgenext/edgenext-base.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
--- a/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
+++ b/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
+_base_ = ['./edgenext-small_8xb256_in1k.py']
+# dataset setting
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=269,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader