Commit 495d9ed9 authored by limm's avatar limm
Browse files

add part code

parent 59b09903
Pipeline #2799 canceled with stages
_base_ = [
'../_base_/models/densenet/densenet121.py',
'../_base_/datasets/imagenet_bs64.py',
'../_base_/schedules/imagenet_bs256.py',
'../_base_/default_runtime.py',
]
# dataset settings
train_dataloader = dict(batch_size=256)
# schedule settings
train_cfg = dict(by_epoch=True, max_epochs=90)
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (4 GPUs) x (256 samples per GPU)
auto_scale_lr = dict(base_batch_size=1024)
_base_ = [
'../_base_/models/densenet/densenet161.py',
'../_base_/datasets/imagenet_bs64.py',
'../_base_/schedules/imagenet_bs256.py',
'../_base_/default_runtime.py',
]
# dataset settings
train_dataloader = dict(batch_size=256)
# schedule settings
train_cfg = dict(by_epoch=True, max_epochs=90)
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (4 GPUs) x (256 samples per GPU)
auto_scale_lr = dict(base_batch_size=1024)
_base_ = [
'../_base_/models/densenet/densenet169.py',
'../_base_/datasets/imagenet_bs64.py',
'../_base_/schedules/imagenet_bs256.py',
'../_base_/default_runtime.py',
]
# dataset settings
train_dataloader = dict(batch_size=256)
# schedule settings
train_cfg = dict(by_epoch=True, max_epochs=90)
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (4 GPUs) x (256 samples per GPU)
auto_scale_lr = dict(base_batch_size=1024)
_base_ = [
'../_base_/models/densenet/densenet201.py',
'../_base_/datasets/imagenet_bs64.py',
'../_base_/schedules/imagenet_bs256.py',
'../_base_/default_runtime.py',
]
# dataset settings
train_dataloader = dict(batch_size=256)
# schedule settings
train_cfg = dict(by_epoch=True, max_epochs=90)
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (4 GPUs) x (256 samples per GPU)
auto_scale_lr = dict(base_batch_size=1024)
Collections:
- Name: DenseNet
Metadata:
Training Data: ImageNet-1k
Architecture:
- DenseBlock
Paper:
URL: https://arxiv.org/abs/1608.06993
Title: Densely Connected Convolutional Networks
README: configs/densenet/README.md
Models:
- Name: densenet121_3rdparty_in1k
Metadata:
FLOPs: 2881695488
Parameters: 7978856
In Collection: DenseNet
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 74.96
Top 5 Accuracy: 92.21
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
Config: configs/densenet/densenet121_4xb256_in1k.py
Converted From:
Weights: https://download.pytorch.org/models/densenet121-a639ec97.pth
Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
- Name: densenet169_3rdparty_in1k
Metadata:
FLOPs: 3416860160
Parameters: 14149480
In Collection: DenseNet
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 76.08
Top 5 Accuracy: 93.11
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth
Config: configs/densenet/densenet169_4xb256_in1k.py
Converted From:
Weights: https://download.pytorch.org/models/densenet169-b2777c0a.pth
Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
- Name: densenet201_3rdparty_in1k
Metadata:
FLOPs: 4365236736
Parameters: 20013928
In Collection: DenseNet
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 77.32
Top 5 Accuracy: 93.64
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth
Config: configs/densenet/densenet201_4xb256_in1k.py
Converted From:
Weights: https://download.pytorch.org/models/densenet201-c1103571.pth
Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
- Name: densenet161_3rdparty_in1k
Metadata:
FLOPs: 7816363968
Parameters: 28681000
In Collection: DenseNet
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 77.61
Top 5 Accuracy: 93.83
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth
Config: configs/densenet/densenet161_4xb256_in1k.py
Converted From:
Weights: https://download.pytorch.org/models/densenet161-8d451a50.pth
Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
# DINOv2
> [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
<!-- [ALGORITHM] -->
## Abstract
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing allpurpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
<div align=center>
<img src="https://user-images.githubusercontent.com/36138628/234560516-b495795c-c75c-444c-a712-bb61a3de444e.png" width="70%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('vit-small-p14_dinov2-pre_3rdparty', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
<!-- [TABS-END] -->
## Models and results
### Pretrained models
| Model | Params (M) | Flops (G) | Config | Download |
| :------------------------------------ | :--------: | :-------: | :--------------------------------------------: | :------------------------------------------------------------------------------------------------: |
| `vit-small-p14_dinov2-pre_3rdparty`\* | 22.06 | 46.76 | [config](vit-small-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth) |
| `vit-base-p14_dinov2-pre_3rdparty`\* | 86.58 | 152.00 | [config](vit-base-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth) |
| `vit-large-p14_dinov2-pre_3rdparty`\* | 304.00 | 507.00 | [config](vit-large-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth) |
| `vit-giant-p14_dinov2-pre_3rdparty`\* | 1136.00 | 1784.00 | [config](vit-giant-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth) |
*Models with * are converted from the [official repo](https://github.com/facebookresearch/dinov2). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@misc{oquab2023dinov2,
title={DINOv2: Learning Robust Visual Features without Supervision},
author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
journal={arXiv:2304.07193},
year={2023}
}
```
Collections:
- Name: DINOv2
Metadata:
Architecture:
- Dropout
- GELU
- Layer Normalization
- Multi-Head Attention
- Scaled Dot-Product Attention
Paper:
Title: 'DINOv2: Learning Robust Visual Features without Supervision'
URL: https://arxiv.org/abs/2304.07193
README: configs/dinov2/README.md
Code:
URL: null
Version: null
Models:
- Name: vit-small-p14_dinov2-pre_3rdparty
Metadata:
FLOPs: 46762000000
Parameters: 22056000
Training Data:
- LVD-142M
In Collection: DINOv2
Results: null
Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth
Config: configs/dinov2/vit-small-p14_dinov2-pre_headless.py
Converted From:
Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth
Code: https://github.com/facebookresearch/dinov2
- Name: vit-base-p14_dinov2-pre_3rdparty
Metadata:
FLOPs: 152000000000
Parameters: 86580000
Training Data:
- LVD-142M
In Collection: DINOv2
Results: null
Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth
Config: configs/dinov2/vit-base-p14_dinov2-pre_headless.py
Converted From:
Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth
Code: https://github.com/facebookresearch/dinov2
- Name: vit-large-p14_dinov2-pre_3rdparty
Metadata:
FLOPs: 507000000000
Parameters: 304000000
Training Data:
- LVD-142M
In Collection: DINOv2
Results: null
Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth
Config: configs/dinov2/vit-large-p14_dinov2-pre_headless.py
Converted From:
Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth
Code: https://github.com/facebookresearch/dinov2
- Name: vit-giant-p14_dinov2-pre_3rdparty
Metadata:
FLOPs: 1784000000000
Parameters: 1136000000
Training Data:
- LVD-142M
In Collection: DINOv2
Results: null
Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth
Config: configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
Converted From:
Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth
Code: https://github.com/facebookresearch/dinov2
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
arch='base',
img_size=518,
patch_size=14,
layer_scale_init_value=1e-5,
),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
# convert image from BGR to RGB
to_rgb=True,
)
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
arch='dinov2-giant',
img_size=518,
patch_size=14,
layer_scale_init_value=1e-5,
layer_cfgs=dict(ffn_type='swiglu_fused'),
),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
# convert image from BGR to RGB
to_rgb=True,
)
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
arch='large',
img_size=518,
patch_size=14,
layer_scale_init_value=1e-5,
),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
# convert image from BGR to RGB
to_rgb=True,
)
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
arch='dinov2-small',
img_size=518,
patch_size=14,
layer_scale_init_value=1e-5,
),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
# convert image from BGR to RGB
to_rgb=True,
)
# EdgeNeXt
> [EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications](https://arxiv.org/abs/2206.10589)
<!-- [ALGORITHM] -->
## Abstract
In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
<div align=center>
<img src="https://github.com/mmaaz60/EdgeNeXt/raw/main/images/EdgeNext.png" width="100%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Predict image**
```python
from mmpretrain import inference_model
predict = inference_model('edgenext-xxsmall_3rdparty_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('edgenext-xxsmall_3rdparty_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
**Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Test:
```shell
python tools/test.py configs/edgenext/edgenext-xxsmall_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
```
<!-- [TABS-END] -->
## Models and results
### Image Classification on ImageNet-1k
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :----------------------------------------------------------------------: |
| `edgenext-xxsmall_3rdparty_in1k`\* | From scratch | 1.33 | 0.26 | 71.20 | 89.91 | [config](edgenext-xxsmall_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth) |
| `edgenext-xsmall_3rdparty_in1k`\* | From scratch | 2.34 | 0.53 | 74.86 | 92.31 | [config](edgenext-xsmall_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth) |
| `edgenext-small_3rdparty_in1k`\* | From scratch | 5.59 | 1.25 | 79.41 | 94.53 | [config](edgenext-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth) |
| `edgenext-small-usi_3rdparty_in1k`\* | From scratch | 5.59 | 1.25 | 81.06 | 95.34 | [config](edgenext-small_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth) |
| `edgenext-base_3rdparty_in1k`\* | From scratch | 18.51 | 3.81 | 82.48 | 96.20 | [config](edgenext-base_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth) |
| `edgenext-base_3rdparty-usi_in1k`\* | From scratch | 18.51 | 3.81 | 83.67 | 96.70 | [config](edgenext-base_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth) |
*Models with * are converted from the [official repo](https://github.com/mmaaz60/EdgeNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@article{Maaz2022EdgeNeXt,
title={EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications},
author={Muhammad Maaz and Abdelrahman Shaker and Hisham Cholakkal and Salman Khan and Syed Waqas Zamir and Rao Muhammad Anwer and Fahad Shahbaz Khan},
journal={2206.10589},
year={2022}
}
```
_base_ = ['./edgenext-base_8xb256_in1k.py']
# dataset setting
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=269,
edge='short',
backend='pillow',
interpolation='bicubic'),
dict(type='CenterCrop', crop_size=256),
dict(type='PackInputs')
]
val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
_base_ = [
'../_base_/models/edgenext/edgenext-base.py',
'../_base_/datasets/imagenet_bs64_edgenext_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py',
]
# schedule setting
optim_wrapper = dict(
optimizer=dict(lr=6e-3),
clip_grad=dict(max_norm=5.0),
)
# runtime setting
custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (32 GPUs) x (128 samples per GPU)
auto_scale_lr = dict(base_batch_size=4096)
_base_ = ['./edgenext-small_8xb256_in1k.py']
# dataset setting
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=269,
edge='short',
backend='pillow',
interpolation='bicubic'),
dict(type='CenterCrop', crop_size=256),
dict(type='PackInputs')
]
val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
_base_ = [
'../_base_/models/edgenext/edgenext-small.py',
'../_base_/datasets/imagenet_bs64_edgenext_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py',
]
# schedule setting
optim_wrapper = dict(
optimizer=dict(lr=6e-3),
clip_grad=dict(max_norm=5.0),
)
# runtime setting
custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (32 GPUs) x (128 samples per GPU)
auto_scale_lr = dict(base_batch_size=4096)
_base_ = [
'../_base_/models/edgenext/edgenext-xsmall.py',
'../_base_/datasets/imagenet_bs64_edgenext_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py',
]
# schedule setting
optim_wrapper = dict(
optimizer=dict(lr=6e-3),
clip_grad=dict(max_norm=5.0),
)
# runtime setting
custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (32 GPUs) x (128 samples per GPU)
auto_scale_lr = dict(base_batch_size=4096)
_base_ = [
'../_base_/models/edgenext/edgenext-xxsmall.py',
'../_base_/datasets/imagenet_bs64_edgenext_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py',
]
# schedule setting
optim_wrapper = dict(
optimizer=dict(lr=6e-3),
clip_grad=dict(max_norm=5.0),
)
# runtime setting
custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
# base_batch_size = (32 GPUs) x (128 samples per GPU)
auto_scale_lr = dict(base_batch_size=4096)
Collections:
- Name: EdgeNeXt
Metadata:
Training Data: ImageNet-1k
Architecture:
- SDTA
- 1x1 Convolution
- Channel Self-attention
Paper:
URL: https://arxiv.org/abs/2206.10589
Title: 'EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications'
README: configs/edgenext/README.md
Code:
Version: v1.0.0rc1
URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.2/mmcls/models/backbones/edgenext.py
Models:
- Name: edgenext-xxsmall_3rdparty_in1k
Metadata:
FLOPs: 255640144
Parameters: 1327216
In Collection: EdgeNeXt
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 71.20
Top 5 Accuracy: 89.91
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
Config: configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
Converted From:
Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xxsmall.pth
Code: https://github.com/mmaaz60/EdgeNeXt
- Name: edgenext-xsmall_3rdparty_in1k
Metadata:
Training Data: ImageNet-1k
FLOPs: 529970560
Parameters: 2336804
In Collection: EdgeNeXt
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 74.86
Top 5 Accuracy: 92.31
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth
Config: configs/edgenext/edgenext-xsmall_8xb256_in1k.py
Converted From:
Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xsmall.pth
Code: https://github.com/mmaaz60/EdgeNeXt
- Name: edgenext-small_3rdparty_in1k
Metadata:
Training Data: ImageNet-1k
FLOPs: 1249788000
Parameters: 5586832
In Collection: EdgeNeXt
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 79.41
Top 5 Accuracy: 94.53
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth
Config: configs/edgenext/edgenext-small_8xb256_in1k.py
Converted From:
Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_small.pth
Code: https://github.com/mmaaz60/EdgeNeXt
- Name: edgenext-small-usi_3rdparty_in1k
Metadata:
Training Data: ImageNet-1k
FLOPs: 1249788000
Parameters: 5586832
In Collection: EdgeNeXt
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 81.06
Top 5 Accuracy: 95.34
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth
Config: configs/edgenext/edgenext-small_8xb256-usi_in1k.py
Converted From:
Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.1/edgenext_small_usi.pth
Code: https://github.com/mmaaz60/EdgeNeXt
- Name: edgenext-base_3rdparty_in1k
Metadata:
Training Data: ImageNet-1k
FLOPs: 3814395280
Parameters: 18511292
In Collection: EdgeNeXt
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 82.48
Top 5 Accuracy: 96.2
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth
Config: configs/edgenext/edgenext-base_8xb256_in1k.py
Converted From:
Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base.pth
Code: https://github.com/mmaaz60/EdgeNeXt
- Name: edgenext-base_3rdparty-usi_in1k
Metadata:
Training Data: ImageNet-1k
FLOPs: 3814395280
Parameters: 18511292
In Collection: EdgeNeXt
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 83.67
Top 5 Accuracy: 96.7
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth
Config: configs/edgenext/edgenext-base_8xb256-usi_in1k.py
Converted From:
Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base_usi.pth
Code: https://github.com/mmaaz60/EdgeNeXt
# EfficientFormer
> [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191)
<!-- [ALGORITHM] -->
## Abstract
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.
<div align=center>
<img src="https://user-images.githubusercontent.com/18586273/180713426-9d3d77e3-3584-42d8-9098-625b4170d796.png" width="100%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Predict image**
```python
from mmpretrain import inference_model
predict = inference_model('efficientformer-l1_3rdparty_8xb128_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('efficientformer-l1_3rdparty_8xb128_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
**Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Test:
```shell
python tools/test.py configs/efficientformer/efficientformer-l1_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth
```
<!-- [TABS-END] -->
## Models and results
### Image Classification on ImageNet-1k
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :---------------------------------------------------------------: |
| `efficientformer-l1_3rdparty_8xb128_in1k`\* | From scratch | 12.28 | 1.30 | 80.46 | 94.99 | [config](efficientformer-l1_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth) |
| `efficientformer-l3_3rdparty_8xb128_in1k`\* | From scratch | 31.41 | 3.74 | 82.45 | 96.18 | [config](efficientformer-l3_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l3_3rdparty_in1k_20220915-466793d6.pth) |
| `efficientformer-l7_3rdparty_8xb128_in1k`\* | From scratch | 82.23 | 10.16 | 83.40 | 96.60 | [config](efficientformer-l7_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l7_3rdparty_in1k_20220915-185e30af.pth) |
*Models with * are converted from the [official repo](https://github.com/snap-research/EfficientFormer). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@misc{https://doi.org/10.48550/arxiv.2206.01191,
doi = {10.48550/ARXIV.2206.01191},
url = {https://arxiv.org/abs/2206.01191},
author = {Li, Yanyu and Yuan, Geng and Wen, Yang and Hu, Eric and Evangelidis, Georgios and Tulyakov, Sergey and Wang, Yanzhi and Ren, Jian},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {EfficientFormer: Vision Transformers at MobileNet Speed},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment