Commit dff2c686 authored by renzhc's avatar renzhc
Browse files

first commit

parent 8f9dd0ed
Pipeline #1665 canceled with stages
# DenseCL
> [Dense contrastive learning for self-supervised visual pre-training](https://arxiv.org/abs/2011.09157)
<!-- [ALGORITHM] -->
## Abstract
To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
<div align=center>
<img src="https://user-images.githubusercontent.com/36138628/149721111-bab03a6d-a30d-418e-b338-43c3689cfc65.png" width="900" />
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Predict image**
```python
from mmpretrain import inference_model
predict = inference_model('resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('densecl_resnet50_8xb32-coslr-200e_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
**Train/Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Train:
```shell
python tools/train.py configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
```
Test:
```shell
python tools/test.py configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
```
<!-- [TABS-END] -->
## Models and results
### Pretrained models
| Model | Params (M) | Flops (G) | Config | Download |
| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------: | :----------------------------------------------------------------------------------------: |
| `densecl_resnet50_8xb32-coslr-200e_in1k` | 64.85 | 4.11 | [config](densecl_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.json) |
### Image Classification on ImageNet-1k
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
| `resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k` | [DENSECL](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) | 25.56 | 4.11 | 63.50 | [config](benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.json) |
## Citation
```bibtex
@inproceedings{wang2021dense,
title={Dense contrastive learning for self-supervised visual pre-training},
author={Wang, Xinlong and Zhang, Rufeng and Shen, Chunhua and Kong, Tao and Li, Lei},
booktitle={CVPR},
year={2021}
}
```
_base_ = [
'../../_base_/models/resnet50.py',
'../../_base_/datasets/imagenet_bs32_pil_resize.py',
'../../_base_/schedules/imagenet_sgd_steplr_100e.py',
'../../_base_/default_runtime.py',
]
model = dict(
backbone=dict(
frozen_stages=4,
init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
# optimizer
optim_wrapper = dict(
type='OptimWrapper',
optimizer=dict(type='SGD', lr=30., momentum=0.9, weight_decay=0.))
# runtime settings
default_hooks = dict(
checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
_base_ = [
'../_base_/datasets/imagenet_bs32_mocov2.py',
'../_base_/schedules/imagenet_sgd_coslr_200e.py',
'../_base_/default_runtime.py',
]
# model settings
model = dict(
type='DenseCL',
queue_len=65536,
feat_dim=128,
momentum=0.001,
loss_lambda=0.5,
backbone=dict(
type='ResNet',
depth=50,
norm_cfg=dict(type='BN'),
zero_init_residual=False),
neck=dict(
type='DenseCLNeck',
in_channels=2048,
hid_channels=2048,
out_channels=128,
num_grid=None),
head=dict(
type='ContrastiveHead',
loss=dict(type='CrossEntropyLoss'),
temperature=0.2),
)
find_unused_parameters = True
# runtime settings
default_hooks = dict(
# only keeps the latest 3 checkpoints
checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
auto_scale_lr = dict(base_batch_size=256)
Collections:
- Name: DenseCL
Metadata:
Training Data: ImageNet-1k
Training Techniques:
- SGD with Momentum
- Weight Decay
Training Resources: 8x V100 GPUs
Architecture:
- ResNet
Paper:
Title: Dense contrastive learning for self-supervised visual pre-training
URL: https://arxiv.org/abs/2011.09157
README: configs/densecl/README.md
Models:
- Name: densecl_resnet50_8xb32-coslr-200e_in1k
Metadata:
Epochs: 200
Batch Size: 256
FLOPs: 4109364224
Parameters: 64850560
Training Data: ImageNet-1k
In Collection: DenseCL
Results: null
Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth
Config: configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
Downstream:
- resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
- Name: resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
Metadata:
Epochs: 100
Batch Size: 256
FLOPs: 4109464576
Parameters: 25557032
Training Data: ImageNet-1k
In Collection: DenseCL
Results:
- Task: Image Classification
Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 63.5
Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
Config: configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
# DenseNet
> [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993)
<!-- [ALGORITHM] -->
## Abstract
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance.
<div align=center>
<img src="https://user-images.githubusercontent.com/42952108/162675098-9a670883-b13a-4a5a-a9c9-06c39c616a0a.png" width="100%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Predict image**
```python
from mmpretrain import inference_model
predict = inference_model('densenet121_3rdparty_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('densenet121_3rdparty_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
**Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Test:
```shell
python tools/test.py configs/densenet/densenet121_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
```
<!-- [TABS-END] -->
## Models and results
### Image Classification on ImageNet-1k
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
| `densenet121_3rdparty_in1k`\* | From scratch | 7.98 | 2.88 | 74.96 | 92.21 | [config](densenet121_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth) |
| `densenet169_3rdparty_in1k`\* | From scratch | 14.15 | 3.42 | 76.08 | 93.11 | [config](densenet169_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth) |
| `densenet201_3rdparty_in1k`\* | From scratch | 20.01 | 4.37 | 77.32 | 93.64 | [config](densenet201_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth) |
| `densenet161_3rdparty_in1k`\* | From scratch | 28.68 | 7.82 | 77.61 | 93.83 | [config](densenet161_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth) |
*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@misc{https://doi.org/10.48550/arxiv.1608.06993,
doi = {10.48550/ARXIV.1608.06993},
url = {https://arxiv.org/abs/1608.06993},
author = {Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q.},
keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Densely Connected Convolutional Networks},
publisher = {arXiv},
year = {2016},
copyright = {arXiv.org perpetual, non-exclusive license}
}
```
_base_ = [
'../_base_/models/densenet/densenet121.py',
'../_base_/datasets/imagenet_bs64.py',
'../_base_/schedules/imagenet_bs256.py',
'../_base_/default_runtime.py',
]
# dataset settings
train_dataloader = dict(batch_size=256)
# schedule settings
train_cfg = dict(by_epoch=True, max_epochs=90)
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (4 GPUs) x (256 samples per GPU)
auto_scale_lr = dict(base_batch_size=1024)
_base_ = [
'../_base_/models/densenet/densenet161.py',
'../_base_/datasets/imagenet_bs64.py',
'../_base_/schedules/imagenet_bs256.py',
'../_base_/default_runtime.py',
]
# dataset settings
train_dataloader = dict(batch_size=256)
# schedule settings
train_cfg = dict(by_epoch=True, max_epochs=90)
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (4 GPUs) x (256 samples per GPU)
auto_scale_lr = dict(base_batch_size=1024)
_base_ = [
'../_base_/models/densenet/densenet169.py',
'../_base_/datasets/imagenet_bs64.py',
'../_base_/schedules/imagenet_bs256.py',
'../_base_/default_runtime.py',
]
# dataset settings
train_dataloader = dict(batch_size=256)
# schedule settings
train_cfg = dict(by_epoch=True, max_epochs=90)
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (4 GPUs) x (256 samples per GPU)
auto_scale_lr = dict(base_batch_size=1024)
_base_ = [
'../_base_/models/densenet/densenet201.py',
'../_base_/datasets/imagenet_bs64.py',
'../_base_/schedules/imagenet_bs256.py',
'../_base_/default_runtime.py',
]
# dataset settings
train_dataloader = dict(batch_size=256)
# schedule settings
train_cfg = dict(by_epoch=True, max_epochs=90)
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (4 GPUs) x (256 samples per GPU)
auto_scale_lr = dict(base_batch_size=1024)
Collections:
- Name: DenseNet
Metadata:
Training Data: ImageNet-1k
Architecture:
- DenseBlock
Paper:
URL: https://arxiv.org/abs/1608.06993
Title: Densely Connected Convolutional Networks
README: configs/densenet/README.md
Models:
- Name: densenet121_3rdparty_in1k
Metadata:
FLOPs: 2881695488
Parameters: 7978856
In Collection: DenseNet
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 74.96
Top 5 Accuracy: 92.21
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
Config: configs/densenet/densenet121_4xb256_in1k.py
Converted From:
Weights: https://download.pytorch.org/models/densenet121-a639ec97.pth
Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
- Name: densenet169_3rdparty_in1k
Metadata:
FLOPs: 3416860160
Parameters: 14149480
In Collection: DenseNet
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 76.08
Top 5 Accuracy: 93.11
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth
Config: configs/densenet/densenet169_4xb256_in1k.py
Converted From:
Weights: https://download.pytorch.org/models/densenet169-b2777c0a.pth
Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
- Name: densenet201_3rdparty_in1k
Metadata:
FLOPs: 4365236736
Parameters: 20013928
In Collection: DenseNet
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 77.32
Top 5 Accuracy: 93.64
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth
Config: configs/densenet/densenet201_4xb256_in1k.py
Converted From:
Weights: https://download.pytorch.org/models/densenet201-c1103571.pth
Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
- Name: densenet161_3rdparty_in1k
Metadata:
FLOPs: 7816363968
Parameters: 28681000
In Collection: DenseNet
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 77.61
Top 5 Accuracy: 93.83
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth
Config: configs/densenet/densenet161_4xb256_in1k.py
Converted From:
Weights: https://download.pytorch.org/models/densenet161-8d451a50.pth
Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
# DINOv2
> [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
<!-- [ALGORITHM] -->
## Abstract
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing allpurpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
<div align=center>
<img src="https://user-images.githubusercontent.com/36138628/234560516-b495795c-c75c-444c-a712-bb61a3de444e.png" width="70%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('vit-small-p14_dinov2-pre_3rdparty', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
<!-- [TABS-END] -->
## Models and results
### Pretrained models
| Model | Params (M) | Flops (G) | Config | Download |
| :------------------------------------ | :--------: | :-------: | :--------------------------------------------: | :------------------------------------------------------------------------------------------------: |
| `vit-small-p14_dinov2-pre_3rdparty`\* | 22.06 | 46.76 | [config](vit-small-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth) |
| `vit-base-p14_dinov2-pre_3rdparty`\* | 86.58 | 152.00 | [config](vit-base-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth) |
| `vit-large-p14_dinov2-pre_3rdparty`\* | 304.00 | 507.00 | [config](vit-large-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth) |
| `vit-giant-p14_dinov2-pre_3rdparty`\* | 1136.00 | 1784.00 | [config](vit-giant-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth) |
*Models with * are converted from the [official repo](https://github.com/facebookresearch/dinov2). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@misc{oquab2023dinov2,
title={DINOv2: Learning Robust Visual Features without Supervision},
author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
journal={arXiv:2304.07193},
year={2023}
}
```
Collections:
- Name: DINOv2
Metadata:
Architecture:
- Dropout
- GELU
- Layer Normalization
- Multi-Head Attention
- Scaled Dot-Product Attention
Paper:
Title: 'DINOv2: Learning Robust Visual Features without Supervision'
URL: https://arxiv.org/abs/2304.07193
README: configs/dinov2/README.md
Code:
URL: null
Version: null
Models:
- Name: vit-small-p14_dinov2-pre_3rdparty
Metadata:
FLOPs: 46762000000
Parameters: 22056000
Training Data:
- LVD-142M
In Collection: DINOv2
Results: null
Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth
Config: configs/dinov2/vit-small-p14_dinov2-pre_headless.py
Converted From:
Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth
Code: https://github.com/facebookresearch/dinov2
- Name: vit-base-p14_dinov2-pre_3rdparty
Metadata:
FLOPs: 152000000000
Parameters: 86580000
Training Data:
- LVD-142M
In Collection: DINOv2
Results: null
Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth
Config: configs/dinov2/vit-base-p14_dinov2-pre_headless.py
Converted From:
Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth
Code: https://github.com/facebookresearch/dinov2
- Name: vit-large-p14_dinov2-pre_3rdparty
Metadata:
FLOPs: 507000000000
Parameters: 304000000
Training Data:
- LVD-142M
In Collection: DINOv2
Results: null
Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth
Config: configs/dinov2/vit-large-p14_dinov2-pre_headless.py
Converted From:
Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth
Code: https://github.com/facebookresearch/dinov2
- Name: vit-giant-p14_dinov2-pre_3rdparty
Metadata:
FLOPs: 1784000000000
Parameters: 1136000000
Training Data:
- LVD-142M
In Collection: DINOv2
Results: null
Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth
Config: configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
Converted From:
Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth
Code: https://github.com/facebookresearch/dinov2
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
arch='base',
img_size=518,
patch_size=14,
layer_scale_init_value=1e-5,
),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
# convert image from BGR to RGB
to_rgb=True,
)
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
arch='dinov2-giant',
img_size=518,
patch_size=14,
layer_scale_init_value=1e-5,
layer_cfgs=dict(ffn_type='swiglu_fused'),
),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
# convert image from BGR to RGB
to_rgb=True,
)
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
arch='large',
img_size=518,
patch_size=14,
layer_scale_init_value=1e-5,
),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
# convert image from BGR to RGB
to_rgb=True,
)
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
arch='dinov2-small',
img_size=518,
patch_size=14,
layer_scale_init_value=1e-5,
),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
# convert image from BGR to RGB
to_rgb=True,
)
# EdgeNeXt
> [EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications](https://arxiv.org/abs/2206.10589)
<!-- [ALGORITHM] -->
## Abstract
In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
<div align=center>
<img src="https://github.com/mmaaz60/EdgeNeXt/raw/main/images/EdgeNext.png" width="100%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Predict image**
```python
from mmpretrain import inference_model
predict = inference_model('edgenext-xxsmall_3rdparty_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('edgenext-xxsmall_3rdparty_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
**Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Test:
```shell
python tools/test.py configs/edgenext/edgenext-xxsmall_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
```
<!-- [TABS-END] -->
## Models and results
### Image Classification on ImageNet-1k
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :----------------------------------------------------------------------: |
| `edgenext-xxsmall_3rdparty_in1k`\* | From scratch | 1.33 | 0.26 | 71.20 | 89.91 | [config](edgenext-xxsmall_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth) |
| `edgenext-xsmall_3rdparty_in1k`\* | From scratch | 2.34 | 0.53 | 74.86 | 92.31 | [config](edgenext-xsmall_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth) |
| `edgenext-small_3rdparty_in1k`\* | From scratch | 5.59 | 1.25 | 79.41 | 94.53 | [config](edgenext-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth) |
| `edgenext-small-usi_3rdparty_in1k`\* | From scratch | 5.59 | 1.25 | 81.06 | 95.34 | [config](edgenext-small_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth) |
| `edgenext-base_3rdparty_in1k`\* | From scratch | 18.51 | 3.81 | 82.48 | 96.20 | [config](edgenext-base_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth) |
| `edgenext-base_3rdparty-usi_in1k`\* | From scratch | 18.51 | 3.81 | 83.67 | 96.70 | [config](edgenext-base_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth) |
*Models with * are converted from the [official repo](https://github.com/mmaaz60/EdgeNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@article{Maaz2022EdgeNeXt,
title={EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications},
author={Muhammad Maaz and Abdelrahman Shaker and Hisham Cholakkal and Salman Khan and Syed Waqas Zamir and Rao Muhammad Anwer and Fahad Shahbaz Khan},
journal={2206.10589},
year={2022}
}
```
_base_ = ['./edgenext-base_8xb256_in1k.py']
# dataset setting
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=269,
edge='short',
backend='pillow',
interpolation='bicubic'),
dict(type='CenterCrop', crop_size=256),
dict(type='PackInputs')
]
val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
_base_ = [
'../_base_/models/edgenext/edgenext-base.py',
'../_base_/datasets/imagenet_bs64_edgenext_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py',
]
# schedule setting
optim_wrapper = dict(
optimizer=dict(lr=6e-3),
clip_grad=dict(max_norm=5.0),
)
# runtime setting
custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (32 GPUs) x (128 samples per GPU)
auto_scale_lr = dict(base_batch_size=4096)
_base_ = ['./edgenext-small_8xb256_in1k.py']
# dataset setting
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=269,
edge='short',
backend='pillow',
interpolation='bicubic'),
dict(type='CenterCrop', crop_size=256),
dict(type='PackInputs')
]
val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment