upload mmpretrain

d476eeba · renzhc · 62b8498e · d476eeba · d476eeba · d476eeba
Commit d476eeba authored Sep 02, 2024 by renzhc
20 changed files
--- a/configs/_base_/schedules/imagenet_sgd_steplr_100e.py
+++ b/configs/_base_/schedules/imagenet_sgd_steplr_100e.py
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=1e-4))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='MultiStepLR', by_epoch=True, milestones=[60, 80], gamma=0.1)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
--- a/configs/arcface/README.md
+++ b/configs/arcface/README.md
+# ArcFace
+
+> [ArcFace: Additive Angular Margin Loss for Deep Face Recognition](https://arxiv.org/abs/1801.07698)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability. In this paper, we first introduce an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly enhances the discriminative power. Since ArcFace is susceptible to the massive label noise, we further propose sub-center ArcFace, in which each class contains K sub-centers and training samples only need to be close to any of the K positive sub-centers. Sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces. Based on this self-propelled isolation, we boost the performance through automatically purifying raw web faces under massive real-world noise. Besides discriminative feature embedding, we also explore the inverse problem, mapping feature vectors to face images. Without training any additional generator or discriminator, the pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and Batch Normalization (BN) priors. Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/212606212-8ffc3cd2-dbc1-4abf-8924-22167f3f6e34.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Retrieve image**
+
+```python
+from mmpretrain import ImageRetrievalInferencer
+
+inferencer = ImageRetrievalInferencer('resnet50-arcface_inshop', prototype='demo/')
+predict = inferencer('demo/dog.jpg', topk=2)[0]
+print(predict[0])
+print(predict[1])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet50-arcface_inshop', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/arcface/resnet50-arcface_8xb32_inshop.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/arcface/resnet50-arcface_8xb32_inshop.py https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Retrieval on InShop
+
+|           Model           |                      Pretrain                      | Params(M) | Flops(G) | Recall@1 | mAP@10 |                    Config                    |                      Download                      |
+| :-----------------------: | :------------------------------------------------: | :-------: | :------: | :------: | :----: | :------------------------------------------: | :------------------------------------------------: |
+| `resnet50-arcface_inshop` | [ImageNet-21k-mill](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth) |   31.69   |  16.48   |  90.18   | 69.30  | [config](./resnet50-arcface_8xb32_inshop.py) | [model](https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.log) |
+
+## Citation
+
+```bibtex
+@inproceedings{deng2018arcface,
+title={ArcFace: Additive Angular Margin Loss for Deep Face Recognition},
+author={Deng, Jiankang and Guo, Jia and Niannan, Xue and Zafeiriou, Stefanos},
+booktitle={CVPR},
+year={2019}
+}
+```
--- a/configs/arcface/metafile.yml
+++ b/configs/arcface/metafile.yml
+Collections:
+  - Name: ArcFace
+    Metadata:
+      Training Data: InShop
+      Architecture:
+        - Additive Angular Margin Loss
+    Paper:
+      URL: https://arxiv.org/abs/1801.07698
+      Title: 'ArcFace: Additive Angular Margin Loss for Deep Face Recognition'
+    README: configs/arcface/README.md
+    Code:
+      Version: v1.0.0rc3
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/heads/margin_head.py
+
+Models:
+  - Name: resnet50-arcface_inshop
+    Metadata:
+      FLOPs: 16571226112
+      Parameters: 31693888
+    In Collection: ArcFace
+    Results:
+      - Dataset: InShop
+        Metrics:
+          Recall@1: 90.18
+          mAP@10: 69.30
+        Task: Image Retrieval
+    Weights: https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth
+    Config: configs/arcface/resnet50-arcface_8xb32_inshop.py
--- a/configs/arcface/resnet50-arcface_8xb32_inshop.py
+++ b/configs/arcface/resnet50-arcface_8xb32_inshop.py
+_base_ = [
+    '../_base_/datasets/inshop_bs32_448.py',
+    '../_base_/schedules/cub_bs64.py',
+    '../_base_/default_runtime.py',
+]
+
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth'  # noqa
+model = dict(
+    type='ImageToImageRetriever',
+    image_encoder=[
+        dict(
+            type='ResNet',
+            depth=50,
+            init_cfg=dict(
+                type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='ArcFaceClsHead',
+        num_classes=3997,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=None),
+    prototype={{_base_.gallery_dataloader}})
+
+# runtime settings
+default_hooks = dict(
+    # log every 20 intervals
+    logger=dict(type='LoggerHook', interval=20),
+    # save last three checkpoints
+    checkpoint=dict(
+        type='CheckpointHook',
+        save_best='auto',
+        interval=1,
+        max_keep_ckpts=3,
+        rule='greater'))
+
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0005, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.01,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=50, val_interval=1)
+
+auto_scale_lr = dict(enable=True, base_batch_size=256)
+
+custom_hooks = [
+    dict(type='PrepareProtoBeforeValLoopHook'),
+    dict(type='SyncBuffersHook')
+]
--- a/configs/barlowtwins/README.md
+++ b/configs/barlowtwins/README.md
+# BarlowTwins
+
+> [Barlow Twins: Self-Supervised Learning via Redundancy Reduction](https://arxiv.org/abs/2103.03230)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/163914714-082de804-0b5f-4024-94f9-880e6ef334fa.png" width="800" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('barlowtwins_resnet50_8xb256-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                         | Params (M) | Flops (G) |                          Config                          |                                     Download                                     |
+| :-------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :------------------------------------------------------------------------------: |
+| `barlowtwins_resnet50_8xb256-coslr-300e_in1k` |   174.54   |   4.11    | [config](barlowtwins_resnet50_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k` | [BARLOWTWINS](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) |   25.56    |   4.11    |   71.80   | [config](benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{zbontar2021barlow,
+  title={Barlow twins: Self-supervised learning via redundancy reduction},
+  author={Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St{\'e}phane},
+  booktitle={International Conference on Machine Learning},
+  year={2021},
+}
+```
--- a/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py
+++ b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_byol.py',
+    '../_base_/default_runtime.py',
+]
+# datasets
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='BarlowTwins',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=8192,
+        out_channels=8192,
+        num_layers=3,
+        with_last_bn=False,
+        with_last_bn_affine=False,
+        with_avg_pool=True,
+        init_cfg=dict(
+            type='Kaiming', distribution='uniform', layer=['Linear'])),
+    head=dict(
+        type='LatentCrossCorrelationHead',
+        in_channels=8192,
+        loss=dict(type='CrossCorrelationLoss')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(
+                decay_mult=0, lr_mult=0.024, lars_exclude=True),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.6e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=990,
+        eta_min=0.0016,
+        by_epoch=True,
+        begin=10,
+        end=1000,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1000)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
--- a/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+++ b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_byol.py',
+    '../_base_/default_runtime.py',
+]
+# datasets
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='BarlowTwins',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=8192,
+        out_channels=8192,
+        num_layers=3,
+        with_last_bn=False,
+        with_last_bn_affine=False,
+        with_avg_pool=True,
+        init_cfg=dict(
+            type='Kaiming', distribution='uniform', layer=['Linear'])),
+    head=dict(
+        type='LatentCrossCorrelationHead',
+        in_channels=8192,
+        loss=dict(type='CrossCorrelationLoss')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(
+                decay_mult=0, lr_mult=0.024, lars_exclude=True),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.6e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=0.0016,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
--- a/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
+++ b/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_coslr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
--- a/configs/barlowtwins/metafile.yml
+++ b/configs/barlowtwins/metafile.yml
+Collections:
+  - Name: BarlowTwins
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 8x A100 GPUs
+      Architecture:
+        - ResNet
+        - BarlowTwins
+    Paper:
+      Title: 'Barlow Twins: Self-Supervised Learning via Redundancy Reduction'
+      URL: https://arxiv.org/abs/2103.03230
+    README: configs/barlowtwins/README.md
+
+Models:
+  - Name: barlowtwins_resnet50_8xb256-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 4109364224
+      Parameters: 174535744
+      Training Data: ImageNet-1k
+    In Collection: BarlowTwins
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth
+    Config: configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+    Downstream:
+      - resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
+  - Name: resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: BarlowTwins
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
+    Config: configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
--- a/configs/beit/README.md
+++ b/configs/beit/README.md
+# BEiT
+
+> [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/203688351-adac7146-4e71-4ab6-8958-5cfe643a2dc5.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_beit-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('beit_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                           | Params (M) | Flops (G) |                           Config                           |                                   Download                                   |
+| :---------------------------------------------- | :--------: | :-------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------: |
+| `beit_beit-base-p16_8xb256-amp-coslr-300e_in1k` |   86.53    |   17.58   | [config](beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                   |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                  Download                  |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `beit-base-p16_beit-pre_8xb128-coslr-100e_in1k` | [BEIT](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) |   86.53    |   17.58   |   83.10   |    N/A    | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.json) |
+| `beit-base-p16_beit-in21k-pre_3rdparty_in1k`\* |             BEIT ImageNet-21k              |   86.53    |   17.58   |   85.28   |   97.59   | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{bao2022beit,
+    title={{BE}iT: {BERT} Pre-Training of Image Transformers},
+    author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
+    booktitle={International Conference on Learning Representations},
+    year={2022},
+}
+```
--- a/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+++ b/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='TwoNormDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    second_mean=[-31.875, -31.875, -31.875],
+    second_std=[318.75, 318.75, 318.75],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ColorJitter',
+        brightness=0.4,
+        contrast=0.4,
+        saturation=0.4,
+        hue=0.),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        second_size=112,
+        interpolation='bicubic',
+        second_interpolation='lanczos',
+        scale=(0.08, 1.0)),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=None,
+        min_num_patches=16),
+    dict(type='PackInputs')
+]
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='BEiT',
+    backbone=dict(
+        type='BEiTPretrainViT',
+        arch='base',
+        patch_size=16,
+        drop_path_rate=0.1,
+        final_norm=True,
+        out_type='raw',
+        layer_scale_init_value=0.1,
+        init_cfg=[
+            dict(type='TruncNormal', std=0.02, layer='Linear'),
+            dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=None,
+    head=dict(
+        type='BEiTV1Head',
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='DALL-E',
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa: E251
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/dalle_encoder.pth',  # noqa: E501
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
--- a/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+++ b/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+data_preprocessor = dict(
+    num_classes=1000,
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        _delete_=True,
+        layer_decay_rate=0.65,
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
--- a/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
+++ b/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
--- a/configs/beit/metafile.yml
+++ b/configs/beit/metafile.yml
+Collections:
+  - Name: BEiT
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'BEiT: BERT Pre-Training of Image Transformers'
+      URL: https://arxiv.org/abs/2106.08254
+    README: configs/beit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: beit_beit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data: ImageNet-1k
+    In Collection: BEiT
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth
+    Config: configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
+  - Name: beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data: ImageNet-1k
+    In Collection: BEiT
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.1
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
+    Config: configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: beit-base-p16_beit-in21k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: BEiT
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.28
+          Top 5 Accuracy: 97.59
+    Weights: https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth
+    Config: configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
+    Converted From:
+      Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_base_patch16_224_pt22k_ft22kto1k.pth
+      Code: https://github.com/microsoft/unilm/tree/master/beit
--- a/configs/beitv2/README.md
+++ b/configs/beitv2/README.md
+# BEiTv2
+
+> [BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers](https://arxiv.org/abs/2208.06366)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/203912182-5967a520-d455-49ea-bc67-dcbd500d76bf.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                             | Params (M) | Flops (G) |                            Config                            |                                 Download                                 |
+| :------------------------------------------------ | :--------: | :-------: | :----------------------------------------------------------: | :----------------------------------------------------------------------: |
+| `beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k` |   192.81   |   17.58   | [config](beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                   |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                  Download                  |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k` | [BEITV2](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) |   86.53    |   17.58   |   85.00   |    N/A    | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.json) |
+| `beit-base-p16_beitv2-in21k-pre_3rdparty_in1k`\* |            BEITV2 ImageNet-21k             |   86.53    |   17.58   |   86.47   |   97.99   | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{beitv2,
+    title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
+    author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
+    year={2022},
+    eprint={2208.06366},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
--- a/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
+++ b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_beitv2.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+vqkd_encoder = dict(
+    arch='base',
+    img_size=224,
+    patch_size=16,
+    in_channels=3,
+    out_indices=-1,
+    drop_rate=0.,
+    drop_path_rate=0.,
+    norm_cfg=dict(type='LN', eps=1e-6),
+    final_norm=True,
+    out_type='featmap',
+    with_cls_token=True,
+    frozen_stages=-1,
+    use_abs_pos_emb=True,
+    use_rel_pos_bias=False,
+    use_shared_rel_pos_bias=False,
+    layer_scale_init_value=0.,
+    interpolate_mode='bicubic',
+    patch_cfg=dict(),
+    layer_cfgs=dict(),
+    init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0.1  # 0. for 300 epochs and 0.1 for 1600 epochs.
+model = dict(
+    type='BEiT',
+    backbone=dict(
+        type='BEiTPretrainViT',
+        arch='base',
+        patch_size=16,
+        out_indices=[-4, -1],
+        drop_path_rate=drop_path_rate,
+        final_norm=False,
+        out_type='raw',
+        layer_scale_init_value=layer_scale_init_value,
+        init_cfg=[
+            dict(type='TruncNormal', std=0.02, layer='Linear'),
+            dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type='BEiTV2Neck',
+        num_layers=2,
+        early_layers=9,
+        backbone_arch='base',
+        drop_path_rate=drop_path_rate,
+        layer_scale_init_value=layer_scale_init_value,
+    ),
+    head=dict(
+        type='BEiTV2Head',
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='VQKD',
+        encoder_config=vqkd_encoder,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth'  # noqa
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
--- a/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+++ b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_beitv2.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+vqkd_encoder = dict(
+    arch='base',
+    img_size=224,
+    patch_size=16,
+    in_channels=3,
+    out_indices=-1,
+    drop_rate=0.,
+    drop_path_rate=0.,
+    norm_cfg=dict(type='LN', eps=1e-6),
+    final_norm=True,
+    out_type='featmap',
+    with_cls_token=True,
+    frozen_stages=-1,
+    use_abs_pos_emb=True,
+    use_rel_pos_bias=False,
+    use_shared_rel_pos_bias=False,
+    layer_scale_init_value=0.,
+    interpolate_mode='bicubic',
+    patch_cfg=dict(),
+    layer_cfgs=dict(),
+    init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0.  # 0. for 300 epochs and 0.1 for 1600 epochs.
+model = dict(
+    type='BEiT',
+    backbone=dict(
+        type='BEiTPretrainViT',
+        arch='base',
+        patch_size=16,
+        out_indices=[-4, -1],
+        drop_path_rate=drop_path_rate,
+        final_norm=False,
+        out_type='raw',
+        layer_scale_init_value=layer_scale_init_value,
+        init_cfg=[
+            dict(type='TruncNormal', std=0.02, layer='Linear'),
+            dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type='BEiTV2Neck',
+        num_layers=2,
+        early_layers=9,
+        backbone_arch='base',
+        drop_path_rate=drop_path_rate,
+        layer_scale_init_value=layer_scale_init_value,
+    ),
+    head=dict(
+        type='BEiTV2Head',
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='VQKD',
+        encoder_config=vqkd_encoder,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth'  # noqa
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.98), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
--- a/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+++ b/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        # 0.2 for 1600 epochs pretrained models and 0.1 for 300 epochs.
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=5e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        _delete_=True,
+        # 0.6 for 1600 epochs pretrained models and 0.65 for 300 epochs
+        layer_decay_rate=0.65,
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
--- a/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
+++ b/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
--- a/configs/beitv2/metafile.yml
+++ b/configs/beitv2/metafile.yml
+Collections:
+  - Name: BEiTv2
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers'
+      URL: https://arxiv.org/abs/2208.06366
+    README: configs/beitv2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581223424
+      Parameters: 192811376
+      Training Data: ImageNet-1k
+    In Collection: BEiTv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth
+    Config: configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
+  - Name: beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data: ImageNet-1k
+    In Collection: BEiTv2
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.0
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
+    Config: configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: beit-base-p16_beitv2-in21k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: BEiTv2
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 86.47
+          Top 5 Accuracy: 97.99
+    Weights: https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth
+    Config: configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
+    Converted From:
+      Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beitv2/beitv2_base_patch16_224_pt1k_ft21kto1k.pth
+      Code: https://github.com/microsoft/unilm/tree/master/beit2