Commit 0fd8347d authored by unknown's avatar unknown
Browse files

添加mmclassification-0.24.1代码,删除mmclassification-speed-benchmark

parent cc567e9e
Collections:
- Name: MobileNet V2
Metadata:
Training Data: ImageNet-1k
Training Techniques:
- SGD with Momentum
- Weight Decay
Training Resources: 8x V100 GPUs
Epochs: 300
Batch Size: 256
Architecture:
- MobileNet V2
Paper:
URL: https://arxiv.org/abs/1801.04381
Title: "MobileNetV2: Inverted Residuals and Linear Bottlenecks"
README: configs/mobilenet_v2/README.md
Code:
URL: https://github.com/open-mmlab/mmclassification/blob/v0.15.0/mmcls/models/backbones/mobilenet_v2.py#L101
Version: v0.15.0
Models:
- Name: mobilenet-v2_8xb32_in1k
Metadata:
FLOPs: 319000000
Parameters: 3500000
In Collection: MobileNet V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 71.86
Top 5 Accuracy: 90.42
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth
Config: configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
_base_ = [
'../_base_/models/mobilenet_v2_1x.py',
'../_base_/datasets/imagenet_bs32_pil_resize.py',
'../_base_/schedules/imagenet_bs256_epochstep.py',
'../_base_/default_runtime.py'
]
#fp16 = dict(loss_scale=512.)
_base_ = 'mobilenet-v2_8xb32_in1k.py'
_deprecation_ = dict(
expected='mobilenet-v2_8xb32_in1k.py',
reference='https://github.com/open-mmlab/mmclassification/pull/508',
)
# MobileNet V3
> [Searching for MobileNetV3](https://arxiv.org/abs/1905.02244)
<!-- [ALGORITHM] -->
## Abstract
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 15% compared to MobileNetV2. MobileNetV3-Small is 4.6% more accurate while reducing latency by 5% compared to MobileNetV2. MobileNetV3-Large detection is 25% faster at roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
<div align=center>
<img src="https://user-images.githubusercontent.com/26739999/142563801-ef4feacc-ecd7-4d14-a411-8c9d63571749.png" width="70%"/>
</div>
## Results and models
### ImageNet-1k
| Model | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :-----------------: | :-------: | :------: | :-------: | :-------: | :----------------------------------------------------------------------: | :------------------------------------------------------------------------: |
| MobileNetV3-Small\* | 2.54 | 0.06 | 67.66 | 87.41 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mobilenet_v3/mobilenet-v3-small_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0.pth) |
| MobileNetV3-Large\* | 5.48 | 0.23 | 74.04 | 91.34 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mobilenet_v3/mobilenet-v3-large_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth) |
*Models with * are converted from [torchvision](https://pytorch.org/vision/stable/_modules/torchvision/models/mobilenetv3.html). The config files of these models are only for validation. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*
## Citation
```
@inproceedings{Howard_2019_ICCV,
author = {Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig},
title = {Searching for MobileNetV3},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}
}
```
Collections:
- Name: MobileNet V3
Metadata:
Training Data: ImageNet-1k
Training Techniques:
- RMSprop with Momentum
- Weight Decay
Training Resources: 8x V100 GPUs
Epochs: 600
Batch Size: 1024
Architecture:
- MobileNet V3
Paper:
URL: https://arxiv.org/abs/1905.02244
Title: Searching for MobileNetV3
README: configs/mobilenet_v3/README.md
Code:
URL: https://github.com/open-mmlab/mmclassification/blob/v0.15.0/mmcls/models/backbones/mobilenet_v3.py
Version: v0.15.0
Models:
- Name: mobilenet_v3_small_imagenet
Metadata:
FLOPs: 60000000
Parameters: 2540000
In Collection: MobileNet V3
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 67.66
Top 5 Accuracy: 87.41
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0.pth
Config: configs/mobilenet_v3/mobilenet-v3-small_8xb32_in1k.py
- Name: mobilenet_v3_large_imagenet
Metadata:
FLOPs: 230000000
Parameters: 5480000
In Collection: MobileNet V3
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 74.04
Top 5 Accuracy: 91.34
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth
Config: configs/mobilenet_v3/mobilenet-v3-large_8xb32_in1k.py
# Refer to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
# ----------------------------
# -[x] auto_augment='imagenet'
# -[x] batch_size=128 (per gpu)
# -[x] epochs=600
# -[x] opt='rmsprop'
# -[x] lr=0.064
# -[x] eps=0.0316
# -[x] alpha=0.9
# -[x] weight_decay=1e-05
# -[x] momentum=0.9
# -[x] lr_gamma=0.973
# -[x] lr_step_size=2
# -[x] nproc_per_node=8
# -[x] random_erase=0.2
# -[x] workers=16 (workers_per_gpu)
# - modify: RandomErasing use RE-M instead of RE-0
_base_ = [
'../_base_/models/mobilenet_v3_large_imagenet.py',
'../_base_/datasets/imagenet_bs32_pil_resize.py',
'../_base_/default_runtime.py'
]
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
policies = [
[
dict(type='Posterize', bits=4, prob=0.4),
dict(type='Rotate', angle=30., prob=0.6)
],
[
dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
dict(type='AutoContrast', prob=0.6)
],
[dict(type='Equalize', prob=0.8),
dict(type='Equalize', prob=0.6)],
[
dict(type='Posterize', bits=5, prob=0.6),
dict(type='Posterize', bits=5, prob=0.6)
],
[
dict(type='Equalize', prob=0.4),
dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
],
[
dict(type='Equalize', prob=0.4),
dict(type='Rotate', angle=30 / 9 * 8, prob=0.8)
],
[
dict(type='Solarize', thr=256 / 9 * 6, prob=0.6),
dict(type='Equalize', prob=0.6)
],
[dict(type='Posterize', bits=6, prob=0.8),
dict(type='Equalize', prob=1.)],
[
dict(type='Rotate', angle=10., prob=0.2),
dict(type='Solarize', thr=256 / 9, prob=0.6)
],
[
dict(type='Equalize', prob=0.6),
dict(type='Posterize', bits=5, prob=0.4)
],
[
dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
dict(type='ColorTransform', magnitude=0., prob=0.4)
],
[
dict(type='Rotate', angle=30., prob=0.4),
dict(type='Equalize', prob=0.6)
],
[dict(type='Equalize', prob=0.0),
dict(type='Equalize', prob=0.8)],
[dict(type='Invert', prob=0.6),
dict(type='Equalize', prob=1.)],
[
dict(type='ColorTransform', magnitude=0.4, prob=0.6),
dict(type='Contrast', magnitude=0.8, prob=1.)
],
[
dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
dict(type='ColorTransform', magnitude=0.2, prob=1.)
],
[
dict(type='ColorTransform', magnitude=0.8, prob=0.8),
dict(type='Solarize', thr=256 / 9 * 2, prob=0.8)
],
[
dict(type='Sharpness', magnitude=0.7, prob=0.4),
dict(type='Invert', prob=0.6)
],
[
dict(
type='Shear',
magnitude=0.3 / 9 * 5,
prob=0.6,
direction='horizontal'),
dict(type='Equalize', prob=1.)
],
[
dict(type='ColorTransform', magnitude=0., prob=0.4),
dict(type='Equalize', prob=0.6)
],
[
dict(type='Equalize', prob=0.4),
dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
],
[
dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
dict(type='AutoContrast', prob=0.6)
],
[dict(type='Invert', prob=0.6),
dict(type='Equalize', prob=1.)],
[
dict(type='ColorTransform', magnitude=0.4, prob=0.6),
dict(type='Contrast', magnitude=0.8, prob=1.)
],
[dict(type='Equalize', prob=0.8),
dict(type='Equalize', prob=0.6)],
]
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224, backend='pillow'),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='AutoAugment', policies=policies),
dict(
type='RandomErasing',
erase_prob=0.2,
mode='const',
min_area_ratio=0.02,
max_area_ratio=1 / 3,
fill_color=img_norm_cfg['mean']),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
data = dict(
samples_per_gpu=128,
workers_per_gpu=4,
train=dict(pipeline=train_pipeline))
evaluation = dict(interval=10, metric='accuracy')
# optimizer
optimizer = dict(
type='RMSprop',
lr=0.064,
alpha=0.9,
momentum=0.9,
eps=0.0316,
weight_decay=1e-5)
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(policy='step', step=2, gamma=0.973, by_epoch=True)
runner = dict(type='EpochBasedRunner', max_epochs=600)
_base_ = [
'../_base_/models/mobilenet-v3-small_cifar.py',
'../_base_/datasets/cifar10_bs16.py',
'../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
]
lr_config = dict(policy='step', step=[120, 170])
runner = dict(type='EpochBasedRunner', max_epochs=200)
# Refer to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
# ----------------------------
# -[x] auto_augment='imagenet'
# -[x] batch_size=128 (per gpu)
# -[x] epochs=600
# -[x] opt='rmsprop'
# -[x] lr=0.064
# -[x] eps=0.0316
# -[x] alpha=0.9
# -[x] weight_decay=1e-05
# -[x] momentum=0.9
# -[x] lr_gamma=0.973
# -[x] lr_step_size=2
# -[x] nproc_per_node=8
# -[x] random_erase=0.2
# -[x] workers=16 (workers_per_gpu)
# - modify: RandomErasing use RE-M instead of RE-0
_base_ = [
'../_base_/models/mobilenet_v3_small_imagenet.py',
'../_base_/datasets/imagenet_bs32_pil_resize.py',
'../_base_/default_runtime.py'
]
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
policies = [
[
dict(type='Posterize', bits=4, prob=0.4),
dict(type='Rotate', angle=30., prob=0.6)
],
[
dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
dict(type='AutoContrast', prob=0.6)
],
[dict(type='Equalize', prob=0.8),
dict(type='Equalize', prob=0.6)],
[
dict(type='Posterize', bits=5, prob=0.6),
dict(type='Posterize', bits=5, prob=0.6)
],
[
dict(type='Equalize', prob=0.4),
dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
],
[
dict(type='Equalize', prob=0.4),
dict(type='Rotate', angle=30 / 9 * 8, prob=0.8)
],
[
dict(type='Solarize', thr=256 / 9 * 6, prob=0.6),
dict(type='Equalize', prob=0.6)
],
[dict(type='Posterize', bits=6, prob=0.8),
dict(type='Equalize', prob=1.)],
[
dict(type='Rotate', angle=10., prob=0.2),
dict(type='Solarize', thr=256 / 9, prob=0.6)
],
[
dict(type='Equalize', prob=0.6),
dict(type='Posterize', bits=5, prob=0.4)
],
[
dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
dict(type='ColorTransform', magnitude=0., prob=0.4)
],
[
dict(type='Rotate', angle=30., prob=0.4),
dict(type='Equalize', prob=0.6)
],
[dict(type='Equalize', prob=0.0),
dict(type='Equalize', prob=0.8)],
[dict(type='Invert', prob=0.6),
dict(type='Equalize', prob=1.)],
[
dict(type='ColorTransform', magnitude=0.4, prob=0.6),
dict(type='Contrast', magnitude=0.8, prob=1.)
],
[
dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
dict(type='ColorTransform', magnitude=0.2, prob=1.)
],
[
dict(type='ColorTransform', magnitude=0.8, prob=0.8),
dict(type='Solarize', thr=256 / 9 * 2, prob=0.8)
],
[
dict(type='Sharpness', magnitude=0.7, prob=0.4),
dict(type='Invert', prob=0.6)
],
[
dict(
type='Shear',
magnitude=0.3 / 9 * 5,
prob=0.6,
direction='horizontal'),
dict(type='Equalize', prob=1.)
],
[
dict(type='ColorTransform', magnitude=0., prob=0.4),
dict(type='Equalize', prob=0.6)
],
[
dict(type='Equalize', prob=0.4),
dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
],
[
dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
dict(type='AutoContrast', prob=0.6)
],
[dict(type='Invert', prob=0.6),
dict(type='Equalize', prob=1.)],
[
dict(type='ColorTransform', magnitude=0.4, prob=0.6),
dict(type='Contrast', magnitude=0.8, prob=1.)
],
[dict(type='Equalize', prob=0.8),
dict(type='Equalize', prob=0.6)],
]
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224, backend='pillow'),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='AutoAugment', policies=policies),
dict(
type='RandomErasing',
erase_prob=0.2,
mode='const',
min_area_ratio=0.02,
max_area_ratio=1 / 3,
fill_color=img_norm_cfg['mean']),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
data = dict(
samples_per_gpu=128,
workers_per_gpu=4,
train=dict(pipeline=train_pipeline))
evaluation = dict(interval=10, metric='accuracy')
# optimizer
optimizer = dict(
type='RMSprop',
lr=0.064,
alpha=0.9,
momentum=0.9,
eps=0.0316,
weight_decay=1e-5)
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(policy='step', step=2, gamma=0.973, by_epoch=True)
runner = dict(type='EpochBasedRunner', max_epochs=600)
_base_ = 'mobilenet-v3-large_8xb32_in1k.py'
_deprecation_ = dict(
expected='mobilenet-v3-large_8xb32_in1k.py',
reference='https://github.com/open-mmlab/mmclassification/pull/508',
)
_base_ = 'mobilenet-v3-small_8xb16_cifar10.py'
_deprecation_ = dict(
expected='mobilenet-v3-small_8xb16_cifar10.py',
reference='https://github.com/open-mmlab/mmclassification/pull/508',
)
_base_ = 'mobilenet-v3-small_8xb32_in1k.py'
_deprecation_ = dict(
expected='mobilenet-v3-small_8xb32_in1k.py',
reference='https://github.com/open-mmlab/mmclassification/pull/508',
)
# MViT V2
> [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf)
<!-- [ALGORITHM] -->
## Abstract
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video
classification, as well as object detection. We present an improved version of MViT that incorporates
decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture
in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where
it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where
it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art
performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as
well as 86.1% on Kinetics-400 video classification.
<div align=center>
<img src="https://user-images.githubusercontent.com/26739999/180376227-755243fa-158e-4068-940a-416036519665.png" width="50%"/>
</div>
## Results and models
### ImageNet-1k
| Model | Pretrain | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :------------: | :----------: | :-------: | :------: | :-------: | :-------: | :------------------------------------------------------------------: | :---------------------------------------------------------------------: |
| MViTv2-tiny\* | From scratch | 24.17 | 4.70 | 82.33 | 96.15 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mvit/mvitv2-tiny_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth) |
| MViTv2-small\* | From scratch | 34.87 | 7.00 | 83.63 | 96.51 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mvit/mvitv2-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-small_3rdparty_in1k_20220722-986bd741.pth) |
| MViTv2-base\* | From scratch | 51.47 | 10.20 | 84.34 | 96.86 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mvit/mvitv2-base_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-base_3rdparty_in1k_20220722-9c4f0a17.pth) |
| MViTv2-large\* | From scratch | 217.99 | 42.10 | 85.25 | 97.14 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mvit/mvitv2-large_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-large_3rdparty_in1k_20220722-2b57b983.pth) |
*Models with * are converted from the [official repo](https://github.com/facebookresearch/mvit). The config files of these models are only for inference. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*
## Citation
```bibtex
@inproceedings{li2021improved,
title={MViTv2: Improved multiscale vision transformers for classification and detection},
author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
booktitle={CVPR},
year={2022}
}
```
Collections:
- Name: MViT V2
Metadata:
Architecture:
- Attention Dropout
- Convolution
- Dense Connections
- GELU
- Layer Normalization
- Scaled Dot-Product Attention
- Attention Pooling
Paper:
URL: http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf
Title: 'MViTv2: Improved Multiscale Vision Transformers for Classification and Detection'
README: configs/mvit/README.md
Code:
URL: https://github.com/open-mmlab/mmclassification/blob/v0.24.0/mmcls/models/backbones/mvit.py
Version: v0.24.0
Models:
- Name: mvitv2-tiny_3rdparty_in1k
In Collection: MViT V2
Metadata:
FLOPs: 4700000000
Parameters: 24173320
Training Data:
- ImageNet-1k
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 82.33
Top 5 Accuracy: 96.15
Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth
Converted From:
Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_T_in1k.pyth
Code: https://github.com/facebookresearch/mvit
Config: configs/mvit/mvitv2-tiny_8xb256_in1k.py
- Name: mvitv2-small_3rdparty_in1k
In Collection: MViT V2
Metadata:
FLOPs: 7000000000
Parameters: 34870216
Training Data:
- ImageNet-1k
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 83.63
Top 5 Accuracy: 96.51
Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-small_3rdparty_in1k_20220722-986bd741.pth
Converted From:
Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_S_in1k.pyth
Code: https://github.com/facebookresearch/mvit
Config: configs/mvit/mvitv2-small_8xb256_in1k.py
- Name: mvitv2-base_3rdparty_in1k
In Collection: MViT V2
Metadata:
FLOPs: 10200000000
Parameters: 51472744
Training Data:
- ImageNet-1k
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 84.34
Top 5 Accuracy: 96.86
Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-base_3rdparty_in1k_20220722-9c4f0a17.pth
Converted From:
Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_B_in1k.pyth
Code: https://github.com/facebookresearch/mvit
Config: configs/mvit/mvitv2-base_8xb256_in1k.py
- Name: mvitv2-large_3rdparty_in1k
In Collection: MViT V2
Metadata:
FLOPs: 42100000000
Parameters: 217992952
Training Data:
- ImageNet-1k
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 85.25
Top 5 Accuracy: 97.14
Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-large_3rdparty_in1k_20220722-2b57b983.pth
Converted From:
Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_L_in1k.pyth
Code: https://github.com/facebookresearch/mvit
Config: configs/mvit/mvitv2-large_8xb256_in1k.py
_base_ = [
'../_base_/models/mvit/mvitv2-base.py',
'../_base_/datasets/imagenet_bs64_swin_224.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
# dataset settings
data = dict(samples_per_gpu=256)
# schedule settings
paramwise_cfg = dict(
norm_decay_mult=0.0,
bias_decay_mult=0.0,
custom_keys={
'.pos_embed': dict(decay_mult=0.0),
'.rel_pos_h': dict(decay_mult=0.0),
'.rel_pos_w': dict(decay_mult=0.0)
})
optimizer = dict(lr=0.00025, paramwise_cfg=paramwise_cfg)
optimizer_config = dict(grad_clip=dict(max_norm=1.0))
# learning policy
lr_config = dict(
policy='CosineAnnealing',
warmup='linear',
warmup_iters=70,
warmup_by_epoch=True)
_base_ = [
'../_base_/models/mvit/mvitv2-large.py',
'../_base_/datasets/imagenet_bs64_swin_224.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]
# dataset settings
data = dict(samples_per_gpu=256)
# schedule settings
paramwise_cfg = dict(
norm_decay_mult=0.0,
bias_decay_mult=0.0,
custom_keys={
'.pos_embed': dict(decay_mult=0.0),
'.rel_pos_h': dict(decay_mult=0.0),
'.rel_pos_w': dict(decay_mult=0.0)
})
optimizer = dict(lr=0.00025, paramwise_cfg=paramwise_cfg)
optimizer_config = dict(grad_clip=dict(max_norm=1.0))
# learning policy
lr_config = dict(
policy='CosineAnnealing',
warmup='linear',
warmup_iters=70,
warmup_by_epoch=True)
_base_ = [
'../_base_/models/mvit/mvitv2-small.py',
'../_base_/datasets/imagenet_bs64_swin_224.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]
# dataset settings
data = dict(samples_per_gpu=256)
# schedule settings
paramwise_cfg = dict(
norm_decay_mult=0.0,
bias_decay_mult=0.0,
custom_keys={
'.pos_embed': dict(decay_mult=0.0),
'.rel_pos_h': dict(decay_mult=0.0),
'.rel_pos_w': dict(decay_mult=0.0)
})
optimizer = dict(lr=0.00025, paramwise_cfg=paramwise_cfg)
optimizer_config = dict(grad_clip=dict(max_norm=1.0))
# learning policy
lr_config = dict(
policy='CosineAnnealing',
warmup='linear',
warmup_iters=70,
warmup_by_epoch=True)
_base_ = [
'../_base_/models/mvit/mvitv2-tiny.py',
'../_base_/datasets/imagenet_bs64_swin_224.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]
# dataset settings
data = dict(samples_per_gpu=256)
# schedule settings
paramwise_cfg = dict(
norm_decay_mult=0.0,
bias_decay_mult=0.0,
custom_keys={
'.pos_embed': dict(decay_mult=0.0),
'.rel_pos_h': dict(decay_mult=0.0),
'.rel_pos_w': dict(decay_mult=0.0)
})
optimizer = dict(lr=0.00025, paramwise_cfg=paramwise_cfg)
optimizer_config = dict(grad_clip=dict(max_norm=1.0))
# learning policy
lr_config = dict(
policy='CosineAnnealing',
warmup='linear',
warmup_iters=70,
warmup_by_epoch=True)
# PoolFormer
> [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)
<!-- [ALGORITHM] -->
## Abstract
Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 49%/61% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of "MetaFormer", a general architecture abstracted from transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design.
<div align=center>
<img src="https://user-images.githubusercontent.com/15921929/144710761-1635f59a-abde-4946-984c-a2c3f22a19d2.png" width="100%"/>
</div>
## Results and models
### ImageNet-1k
| Model | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :--------------: | :-------: | :------: | :-------: | :-------: | :-----------------------------------------------------------------------: | :--------------------------------------------------------------------------: |
| PoolFormer-S12\* | 11.92 | 1.87 | 77.24 | 93.51 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/poolformer/poolformer-s12_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth) |
| PoolFormer-S24\* | 21.39 | 3.51 | 80.33 | 95.05 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/poolformer/poolformer-s24_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s24_3rdparty_32xb128_in1k_20220414-d7055904.pth) |
| PoolFormer-S36\* | 30.86 | 5.15 | 81.43 | 95.45 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/poolformer/poolformer-s36_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s36_3rdparty_32xb128_in1k_20220414-d78ff3e8.pth) |
| PoolFormer-M36\* | 56.17 | 8.96 | 82.14 | 95.71 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/poolformer/poolformer-m36_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m36_3rdparty_32xb128_in1k_20220414-c55e0949.pth) |
| PoolFormer-M48\* | 73.47 | 11.80 | 82.51 | 95.95 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/poolformer/poolformer-m48_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m48_3rdparty_32xb128_in1k_20220414-9378f3eb.pth) |
*Models with * are converted from the [official repo](https://github.com/sail-sg/poolformer). The config files of these models are only for inference. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*
## Citation
```bibtex
@article{yu2021metaformer,
title={MetaFormer is Actually What You Need for Vision},
author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
journal={arXiv preprint arXiv:2111.11418},
year={2021}
}
```
Collections:
- Name: PoolFormer
Metadata:
Training Data: ImageNet-1k
Architecture:
- Pooling
- 1x1 Convolution
- LayerScale
Paper:
URL: https://arxiv.org/abs/2111.11418
Title: MetaFormer is Actually What You Need for Vision
README: configs/poolformer/README.md
Code:
Version: v0.22.1
URL: https://github.com/open-mmlab/mmclassification/blob/v0.22.1/mmcls/models/backbones/poolformer.py
Models:
- Name: poolformer-s12_3rdparty_32xb128_in1k
Metadata:
FLOPs: 1871399424
Parameters: 11915176
In Collections: PoolFormer
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 77.24
Top 5 Accuracy: 93.51
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth
Config: configs/poolformer/poolformer-s12_32xb128_in1k.py
Converted From:
Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s12.pth.tar
Code: https://github.com/sail-sg/poolformer
- Name: poolformer-s24_3rdparty_32xb128_in1k
Metadata:
Training Data: ImageNet-1k
FLOPs: 3510411008
Parameters: 21388968
In Collections: PoolFormer
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 80.33
Top 5 Accuracy: 95.05
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s24_3rdparty_32xb128_in1k_20220414-d7055904.pth
Config: configs/poolformer/poolformer-s24_32xb128_in1k.py
Converted From:
Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s24.pth.tar
Code: https://github.com/sail-sg/poolformer
- Name: poolformer-s36_3rdparty_32xb128_in1k
Metadata:
FLOPs: 5149422592
Parameters: 30862760
In Collections: PoolFormer
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 81.43
Top 5 Accuracy: 95.45
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s36_3rdparty_32xb128_in1k_20220414-d78ff3e8.pth
Config: configs/poolformer/poolformer-s36_32xb128_in1k.py
Converted From:
Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s36.pth.tar
Code: https://github.com/sail-sg/poolformer
- Name: poolformer-m36_3rdparty_32xb128_in1k
Metadata:
Training Data: ImageNet-1k
FLOPs: 8960175744
Parameters: 56172520
In Collections: PoolFormer
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 82.14
Top 5 Accuracy: 95.71
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m36_3rdparty_32xb128_in1k_20220414-c55e0949.pth
Config: configs/poolformer/poolformer-m36_32xb128_in1k.py
Converted From:
Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m36.pth.tar
Code: https://github.com/sail-sg/poolformer
- Name: poolformer-m48_3rdparty_32xb128_in1k
Metadata:
FLOPs: 11801805696
Parameters: 73473448
In Collections: PoolFormer
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 82.51
Top 5 Accuracy: 95.95
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m48_3rdparty_32xb128_in1k_20220414-9378f3eb.pth
Config: configs/poolformer/poolformer-m48_32xb128_in1k.py
Converted From:
Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m48.pth.tar
Code: https://github.com/sail-sg/poolformer
_base_ = [
'../_base_/models/poolformer/poolformer_m36.py',
'../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py',
]
optimizer = dict(lr=4e-3)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment