Commit 0fd8347d authored by unknown's avatar unknown
Browse files

添加mmclassification-0.24.1代码,删除mmclassification-speed-benchmark

parent cc567e9e
_base_ = 'swin-tiny_16xb64_in1k.py'
_deprecation_ = dict(
expected='swin-tiny_16xb64_in1k.py',
reference='https://github.com/open-mmlab/mmclassification/pull/508',
)
# Swin Transformer V2
> [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883.pdf)
<!-- [ALGORITHM] -->
## Abstract
Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.
<div align=center>
<img src="https://user-images.githubusercontent.com/42952108/180748696-ee7ed23d-7fee-4ccf-9eb5-f117db228a42.png" width="100%"/>
</div>
## Results and models
### ImageNet-21k
The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don't have evaluation results.
| Model | resolution | Params(M) | Flops(G) | Download |
| :------: | :--------: | :-------: | :------: | :--------------------------------------------------------------------------------------------------------------------------------------: |
| Swin-B\* | 192x192 | 87.92 | 8.51 | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-base-w12_3rdparty_in21k-192px_20220803-f7dc9763.pth) |
| Swin-L\* | 192x192 | 196.74 | 19.04 | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-large-w12_3rdparty_in21k-192px_20220803-d9073fee.pth) |
### ImageNet-1k
| Model | Pretrain | resolution | window | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :------: | :----------: | :--------: | :----: | :-------: | :------: | :-------: | :-------: | :-------------------------------------------------------------: | :----------------------------------------------------------------: |
| Swin-T\* | From scratch | 256x256 | 8x8 | 28.35 | 4.35 | 81.76 | 95.87 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth) |
| Swin-T\* | From scratch | 256x256 | 16x16 | 28.35 | 4.4 | 82.81 | 96.23 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w16_3rdparty_in1k-256px_20220803-9651cdd7.pth) |
| Swin-S\* | From scratch | 256x256 | 8x8 | 49.73 | 8.45 | 83.74 | 96.6 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w8_3rdparty_in1k-256px_20220803-b01a4332.pth) |
| Swin-S\* | From scratch | 256x256 | 16x16 | 49.73 | 8.57 | 84.13 | 96.83 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w16_3rdparty_in1k-256px_20220803-b707d206.pth) |
| Swin-B\* | From scratch | 256x256 | 8x8 | 87.92 | 14.99 | 84.2 | 96.86 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w8_3rdparty_in1k-256px_20220803-8ff28f2b.pth) |
| Swin-B\* | From scratch | 256x256 | 16x16 | 87.92 | 15.14 | 84.6 | 97.05 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_3rdparty_in1k-256px_20220803-5a1886b7.pth) |
| Swin-B\* | ImageNet-21k | 256x256 | 16x16 | 87.92 | 15.14 | 86.17 | 97.88 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_in21k-pre_3rdparty_in1k-256px_20220803-8d7aa8ad.pth) |
| Swin-B\* | ImageNet-21k | 384x384 | 24x24 | 87.92 | 34.07 | 87.14 | 98.23 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w24_in21k-pre_3rdparty_in1k-384px_20220803-44eb70f8.pth) |
| Swin-L\* | ImageNet-21k | 256X256 | 16x16 | 196.75 | 33.86 | 86.93 | 98.06 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w16_in21k-pre_3rdparty_in1k-256px_20220803-c40cbed7.pth) |
| Swin-L\* | ImageNet-21k | 384x384 | 24x24 | 196.75 | 76.2 | 87.59 | 98.27 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w24_in21k-pre_3rdparty_in1k-384px_20220803-3b36c165.pth) |
*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer#main-results-on-imagenet-with-pretrained-models). The config files of these models are only for validation. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*
*ImageNet-21k pretrained models with input resolution of 256x256 and 384x384 both fine-tuned from the same pre-training model using a smaller input resolution of 192x192.*
## Citation
```
@article{https://doi.org/10.48550/arxiv.2111.09883,
doi = {10.48550/ARXIV.2111.09883},
url = {https://arxiv.org/abs/2111.09883},
author = {Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, Yue and Zhang, Zheng and Dong, Li and Wei, Furu and Guo, Baining},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Swin Transformer V2: Scaling Up Capacity and Resolution},
publisher = {arXiv},
year = {2021},
copyright = {Creative Commons Attribution 4.0 International}
}
```
Collections:
- Name: Swin-Transformer-V2
Metadata:
Training Data: ImageNet-1k
Training Techniques:
- AdamW
- Weight Decay
Training Resources: 16x V100 GPUs
Epochs: 300
Batch Size: 1024
Architecture:
- Shift Window Multihead Self Attention
Paper:
URL: https://arxiv.org/abs/2111.09883.pdf
Title: "Swin Transformer V2: Scaling Up Capacity and Resolution"
README: configs/swin_transformer_v2/README.md
Models:
- Name: swinv2-tiny-w8_3rdparty_in1k-256px
Metadata:
FLOPs: 4350000000
Parameters: 28350000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 81.76
Top 5 Accuracy: 95.87
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth
Config: configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_tiny_patch4_window8_256.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-tiny-w16_3rdparty_in1k-256px
Metadata:
FLOPs: 4400000000
Parameters: 28350000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 82.81
Top 5 Accuracy: 96.23
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w16_3rdparty_in1k-256px_20220803-9651cdd7.pth
Config: configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_tiny_patch4_window16_256.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-small-w8_3rdparty_in1k-256px
Metadata:
FLOPs: 8450000000
Parameters: 49730000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 83.74
Top 5 Accuracy: 96.6
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w8_3rdparty_in1k-256px_20220803-b01a4332.pth
Config: configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_small_patch4_window8_256.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-small-w16_3rdparty_in1k-256px
Metadata:
FLOPs: 8570000000
Parameters: 49730000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 84.13
Top 5 Accuracy: 96.83
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w16_3rdparty_in1k-256px_20220803-b707d206.pth
Config: configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_small_patch4_window16_256.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-base-w8_3rdparty_in1k-256px
Metadata:
FLOPs: 14990000000
Parameters: 87920000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 84.2
Top 5 Accuracy: 96.86
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w8_3rdparty_in1k-256px_20220803-8ff28f2b.pth
Config: configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window8_256.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-base-w16_3rdparty_in1k-256px
Metadata:
FLOPs: 15140000000
Parameters: 87920000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 84.6
Top 5 Accuracy: 97.05
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_3rdparty_in1k-256px_20220803-5a1886b7.pth
Config: configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window16_256.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-base-w16_in21k-pre_3rdparty_in1k-256px
Metadata:
Training Data: ImageNet-21k
FLOPs: 15140000000
Parameters: 87920000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 86.17
Top 5 Accuracy: 97.88
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_in21k-pre_3rdparty_in1k-256px_20220803-8d7aa8ad.pth
Config: configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12to16_192to256_22kto1k_ft.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-base-w24_in21k-pre_3rdparty_in1k-384px
Metadata:
Training Data: ImageNet-21k
FLOPs: 34070000000
Parameters: 87920000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 87.14
Top 5 Accuracy: 98.23
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w24_in21k-pre_3rdparty_in1k-384px_20220803-44eb70f8.pth
Config: configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12to24_192to384_22kto1k_ft.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-large-w16_in21k-pre_3rdparty_in1k-256px
Metadata:
Training Data: ImageNet-21k
FLOPs: 33860000000
Parameters: 196750000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 86.93
Top 5 Accuracy: 98.06
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w16_in21k-pre_3rdparty_in1k-256px_20220803-c40cbed7.pth
Config: configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12to16_192to256_22kto1k_ft.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-large-w24_in21k-pre_3rdparty_in1k-384px
Metadata:
Training Data: ImageNet-21k
FLOPs: 76200000000
Parameters: 196750000
In Collection: Swin-Transformer-V2
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 87.59
Top 5 Accuracy: 98.27
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w24_in21k-pre_3rdparty_in1k-384px_20220803-3b36c165.pth
Config: configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12to24_192to384_22kto1k_ft.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-base-w12_3rdparty_in21k-192px
Metadata:
Training Data: ImageNet-21k
FLOPs: 8510000000
Parameters: 87920000
In Collections: Swin-Transformer-V2
Results: null
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-base-w12_3rdparty_in21k-192px_20220803-f7dc9763.pth
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12_192_22k.pth
Code: https://github.com/microsoft/Swin-Transformer
- Name: swinv2-large-w12_3rdparty_in21k-192px
Metadata:
Training Data: ImageNet-21k
FLOPs: 19040000000
Parameters: 196740000
In Collections: Swin-Transformer-V2
Results: null
Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-large-w12_3rdparty_in21k-192px_20220803-d9073fee.pth
Converted From:
Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12_192_22k.pth
Code: https://github.com/microsoft/Swin-Transformer
_base_ = [
'../_base_/models/swin_transformer_v2/base_256.py',
'../_base_/datasets/imagenet_bs64_swin_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
_base_ = [
'../_base_/models/swin_transformer_v2/base_256.py',
'../_base_/datasets/imagenet_bs64_swin_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
model = dict(
type='ImageClassifier',
backbone=dict(
window_size=[16, 16, 16, 8],
drop_path_rate=0.2,
pretrained_window_sizes=[12, 12, 12, 6]))
_base_ = [
'../_base_/models/swin_transformer_v2/base_384.py',
'../_base_/datasets/imagenet_bs64_swin_384.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
model = dict(
type='ImageClassifier',
backbone=dict(
img_size=384,
window_size=[24, 24, 24, 12],
drop_path_rate=0.2,
pretrained_window_sizes=[12, 12, 12, 6]))
_base_ = [
'../_base_/models/swin_transformer_v2/base_256.py',
'../_base_/datasets/imagenet_bs64_swin_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
# Only for evaluation
_base_ = [
'../_base_/models/swin_transformer_v2/large_256.py',
'../_base_/datasets/imagenet_bs64_swin_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
model = dict(
type='ImageClassifier',
backbone=dict(
window_size=[16, 16, 16, 8], pretrained_window_sizes=[12, 12, 12, 6]),
)
# Only for evaluation
_base_ = [
'../_base_/models/swin_transformer_v2/large_384.py',
'../_base_/datasets/imagenet_bs64_swin_384.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
model = dict(
type='ImageClassifier',
backbone=dict(
img_size=384,
window_size=[24, 24, 24, 12],
pretrained_window_sizes=[12, 12, 12, 6]),
)
_base_ = [
'../_base_/models/swin_transformer_v2/small_256.py',
'../_base_/datasets/imagenet_bs64_swin_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
_base_ = [
'../_base_/models/swin_transformer_v2/small_256.py',
'../_base_/datasets/imagenet_bs64_swin_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
_base_ = [
'../_base_/models/swin_transformer_v2/tiny_256.py',
'../_base_/datasets/imagenet_bs64_swin_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
_base_ = [
'../_base_/models/swin_transformer_v2/tiny_256.py',
'../_base_/datasets/imagenet_bs64_swin_256.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
# Tokens-to-Token ViT
> [Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet](https://arxiv.org/abs/2101.11986)
<!-- [ALGORITHM] -->
## Abstract
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, \\eg, the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384×384 on ImageNet.
<div align=center>
<img src="https://user-images.githubusercontent.com/26739999/142578381-e9040610-05d9-457c-8bf5-01c2fa94add2.png" width="60%"/>
</div>
## Results and models
### ImageNet-1k
| Model | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :----------: | :-------: | :------: | :-------: | :-------: | :-------------------------------------------------------------------------: | :----------------------------------------------------------------------------: |
| T2T-ViT_t-14 | 21.47 | 4.34 | 81.83 | 95.84 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.log.json) |
| T2T-ViT_t-19 | 39.08 | 7.80 | 82.63 | 96.18 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.log.json) |
| T2T-ViT_t-24 | 64.00 | 12.69 | 82.71 | 96.09 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.log.json) |
*In consistent with the [official repo](https://github.com/yitu-opensource/T2T-ViT), we adopt the best checkpoints during training.*
## Citation
```
@article{yuan2021tokens,
title={Tokens-to-token vit: Training vision transformers from scratch on imagenet},
author={Yuan, Li and Chen, Yunpeng and Wang, Tao and Yu, Weihao and Shi, Yujun and Tay, Francis EH and Feng, Jiashi and Yan, Shuicheng},
journal={arXiv preprint arXiv:2101.11986},
year={2021}
}
```
Collections:
- Name: Tokens-to-Token ViT
Metadata:
Training Data: ImageNet-1k
Architecture:
- Layer Normalization
- Scaled Dot-Product Attention
- Attention Dropout
- Dropout
- Tokens to Token
Paper:
URL: https://arxiv.org/abs/2101.11986
Title: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet"
README: configs/t2t_vit/README.md
Code:
URL: https://github.com/open-mmlab/mmclassification/blob/v0.17.0/mmcls/models/backbones/t2t_vit.py
Version: v0.17.0
Models:
- Name: t2t-vit-t-14_8xb64_in1k
Metadata:
FLOPs: 4340000000
Parameters: 21470000
In Collection: Tokens-to-Token ViT
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 81.83
Top 5 Accuracy: 95.84
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth
Config: configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
- Name: t2t-vit-t-19_8xb64_in1k
Metadata:
FLOPs: 7800000000
Parameters: 39080000
In Collection: Tokens-to-Token ViT
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 82.63
Top 5 Accuracy: 96.18
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.pth
Config: configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
- Name: t2t-vit-t-24_8xb64_in1k
Metadata:
FLOPs: 12690000000
Parameters: 64000000
In Collection: Tokens-to-Token ViT
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 82.71
Top 5 Accuracy: 96.09
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.pth
Config: configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
_base_ = [
'../_base_/models/t2t-vit-t-14.py',
'../_base_/datasets/imagenet_bs64_t2t_224.py',
'../_base_/default_runtime.py',
]
# optimizer
paramwise_cfg = dict(
norm_decay_mult=0.0,
bias_decay_mult=0.0,
custom_keys={'cls_token': dict(decay_mult=0.0)},
)
optimizer = dict(
type='AdamW',
lr=5e-4,
weight_decay=0.05,
paramwise_cfg=paramwise_cfg,
)
optimizer_config = dict(grad_clip=None)
# learning policy
# FIXME: lr in the first 300 epochs conforms to the CosineAnnealing and
# the lr in the last 10 epoch equals to min_lr
lr_config = dict(
policy='CosineAnnealingCooldown',
min_lr=1e-5,
cool_down_time=10,
cool_down_ratio=0.1,
by_epoch=True,
warmup_by_epoch=True,
warmup='linear',
warmup_iters=10,
warmup_ratio=1e-6)
custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
runner = dict(type='EpochBasedRunner', max_epochs=310)
_base_ = [
'../_base_/models/t2t-vit-t-19.py',
'../_base_/datasets/imagenet_bs64_t2t_224.py',
'../_base_/default_runtime.py',
]
# optimizer
paramwise_cfg = dict(
norm_decay_mult=0.0,
bias_decay_mult=0.0,
custom_keys={'cls_token': dict(decay_mult=0.0)},
)
optimizer = dict(
type='AdamW',
lr=5e-4,
weight_decay=0.065,
paramwise_cfg=paramwise_cfg,
)
optimizer_config = dict(grad_clip=None)
# learning policy
# FIXME: lr in the first 300 epochs conforms to the CosineAnnealing and
# the lr in the last 10 epoch equals to min_lr
lr_config = dict(
policy='CosineAnnealingCooldown',
min_lr=1e-5,
cool_down_time=10,
cool_down_ratio=0.1,
by_epoch=True,
warmup_by_epoch=True,
warmup='linear',
warmup_iters=10,
warmup_ratio=1e-6)
custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
runner = dict(type='EpochBasedRunner', max_epochs=310)
_base_ = [
'../_base_/models/t2t-vit-t-24.py',
'../_base_/datasets/imagenet_bs64_t2t_224.py',
'../_base_/default_runtime.py',
]
# optimizer
paramwise_cfg = dict(
norm_decay_mult=0.0,
bias_decay_mult=0.0,
custom_keys={'cls_token': dict(decay_mult=0.0)},
)
optimizer = dict(
type='AdamW',
lr=5e-4,
weight_decay=0.065,
paramwise_cfg=paramwise_cfg,
)
optimizer_config = dict(grad_clip=None)
# learning policy
# FIXME: lr in the first 300 epochs conforms to the CosineAnnealing and
# the lr in the last 10 epoch equals to min_lr
lr_config = dict(
policy='CosineAnnealingCooldown',
min_lr=1e-5,
cool_down_time=10,
cool_down_ratio=0.1,
by_epoch=True,
warmup_by_epoch=True,
warmup='linear',
warmup_iters=10,
warmup_ratio=1e-6)
custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
runner = dict(type='EpochBasedRunner', max_epochs=310)
# TNT
> [Transformer in Transformer](https://arxiv.org/abs/2103.00112)
<!-- [ALGORITHM] -->
## Abstract
Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16×16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4×4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost.
<div align=center>
<img src="https://user-images.githubusercontent.com/26739999/142578661-298d92a1-2e25-4910-a312-085587be6b65.png" width="80%"/>
</div>
## Results and models
### ImageNet-1k
| Model | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :---------: | :-------: | :------: | :-------: | :-------: | :--------------------------------------------------------------------------: | :----------------------------------------------------------------------------: |
| TNT-small\* | 23.76 | 3.36 | 81.52 | 95.73 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/tnt/tnt-s-p16_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth) |
*Models with * are converted from [timm](https://github.com/rwightman/pytorch-image-models/). The config files of these models are only for validation. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*
## Citation
```
@misc{han2021transformer,
title={Transformer in Transformer},
author={Kai Han and An Xiao and Enhua Wu and Jianyuan Guo and Chunjing Xu and Yunhe Wang},
year={2021},
eprint={2103.00112},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
Collections:
- Name: Transformer in Transformer
Metadata:
Training Data: ImageNet-1k
Paper:
URL: https://arxiv.org/abs/2103.00112
Title: "Transformer in Transformer"
README: configs/tnt/README.md
Code:
URL: https://github.com/open-mmlab/mmclassification/blob/v0.15.0/mmcls/models/backbones/tnt.py#L203
Version: v0.15.0
Models:
- Name: tnt-small-p16_3rdparty_in1k
Metadata:
FLOPs: 3360000000
Parameters: 23760000
In Collection: Transformer in Transformer
Results:
- Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 81.52
Top 5 Accuracy: 95.73
Task: Image Classification
Weights: https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth
Config: configs/tnt/tnt-s-p16_16xb64_in1k.py
Converted From:
Weights: https://github.com/contrastive/pytorch-image-models/releases/download/TNT/tnt_s_patch16_224.pth.tar
Code: https://github.com/contrastive/pytorch-image-models/blob/809271b0f3e5d9be4e11c0c5cec1dbba8b5e2c60/timm/models/tnt.py#L144
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment