first commit

dff2c686 · renzhc · 8f9dd0ed · dff2c686 · dff2c686 · dff2c686
Commit dff2c686 authored Sep 03, 2024 by renzhc
20 changed files
--- a/configs/maskfeat/README.md
+++ b/configs/maskfeat/README.md
+# MaskFeat
+
+> [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133v1)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/48178838/190090285-428f07c0-0887-4ce8-b94f-f719cfd25622.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                              | Params (M) | Flops (G) |                            Config                             |                                Download                                |
+| :------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :--------------------------------------------------------------------: |
+| `maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k` |   85.88    |   17.58   | [config](maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k` | [MASKFEAT](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth) |   86.57    |   17.58   |   83.40   | [config](benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.json) |
+
+## Citation
+
+```bibtex
+@InProceedings{wei2022masked,
+    author    = {Wei, Chen and Fan, Haoqi and Xie, Saining and Wu, Chao-Yuan and Yuille, Alan and Feichtenhofer, Christoph},
+    title     = {Masked Feature Prediction for Self-Supervised Visual Pre-Training},
+    booktitle = {CVPR},
+    year      = {2022},
+}
+```
--- a/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
+++ b/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs'),
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=256, dataset=dict(pipeline=test_pipeline))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=2e-5, bias=2e-5)
+        ]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=8e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0)
--- a/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+++ b/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.5, 1.0),
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=14,
+        num_masking_patches=78,
+        min_num_patches=15,
+    ),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='MaskFeat',
+    backbone=dict(type='MaskFeatViT', arch='b', patch_size=16),
+    neck=dict(
+        type='LinearNeck',
+        in_channels=768,
+        out_channels=108,
+        norm_cfg=None,
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=0.02, bias=0)),
+    head=dict(
+        type='MIMHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    target_generator=dict(
+        type='HOGGenerator', nbins=9, pool=8, gaussian_window=16))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW', lr=2e-4 * 8, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=0.02),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        norm_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            # 'pos_embed': dict(decay_mult=0.),
+            # 'cls_token': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=30,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=270,
+        eta_min=1e-6,
+        by_epoch=True,
+        begin=30,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
--- a/configs/maskfeat/metafile.yml
+++ b/configs/maskfeat/metafile.yml
+Collections:
+  - Name: MaskFeat
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Masked Feature Prediction for Self-Supervised Visual Pre-Training
+      URL: https://arxiv.org/abs/2112.09133v1
+    README: configs/maskfeat/README.md
+
+Models:
+  - Name: maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581972224
+      Parameters: 85882692
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth
+    Config: configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k
+  - Name: vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.4
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth
+    Config: configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
--- a/configs/mff/README.md
+++ b/configs/mff/README.md
+# MFF
+
+> [Improving Pixel-based MIM by Reducing Wasted Modeling Capability](https://arxiv.org/abs/2308.00261)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/257412932-5f36b11b-ee64-4ce7-b7d1-a31000302bd8.png" width="80%"/>
+</div>
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py None
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                         | Params (M) | Flops (G) |                          Config                          |                                     Download                                     |
+| :-------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :------------------------------------------------------------------------------: |
+| `mff_vit-base-p16_8xb512-amp-coslr-300e_in1k` |     -      |     -     | [config](mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.json) |
+| `mff_vit-base-p16_8xb512-amp-coslr-800e_in1k` |     -      |     -     | [config](mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k` | [MFF 300-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) |   86.57    |   17.58   |   83.00   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth)  /   [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.json) |
+| `vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k` | [MFF 800-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth) |   86.57    |   17.58   |   83.70   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.pth) / [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.json) |
+| `vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k` | [MFF 300-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) |   304.33   |   61.60   |   64.20   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb2048-linear-coslr-90e_in1k/vit-base-p16_8xb2048-linear-coslr-90e_in1k.json) |
+| `vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MFF 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) |   304.33   |   61.60   |   68.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb2048-linear-coslr-90e/vit-base-p16_8xb2048-linear-coslr-90e_20230802-6b1f7bc8.pth)  / [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb2048-linear-coslr-90e/vit-base-p16_8xb2048-linear-coslr-90e_20230802-6b1f7bc8.json) |
+
+## Citation
+
+```bibtex
+@article{MFF,
+  title={Improving Pixel-based MIM by Reducing Wasted Modeling Capability},
+  author={Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin},
+  journal={arXiv},
+  year={2023}
+}
+```
--- a/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+++ b/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=2e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0, diff_rank_seed=True)
--- a/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+++ b/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ToPIL', to_rgb=True),
+    dict(type='MAERandomResizedCrop', size=224, interpolation=3),
+    dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+    dict(type='ToNumpy', to_bgr=True),
+    dict(type='PackInputs'),
+]
+
+# dataset settings
+train_dataloader = dict(
+    batch_size=2048, drop_last=True, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=10,
+        end=90,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=1),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
--- a/configs/mff/metafile.yml
+++ b/configs/mff/metafile.yml
+Collections:
+  - Name: MFF
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Improving Pixel-based MIM by Reducing Wasted Modeling Capability
+      URL: https://arxiv.org/pdf/2308.00261.pdf
+    README: configs/mff/README.md
+
+Models:
+  - Name: mff_vit-base-p16_8xb512-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581972224
+      Parameters: 85882692
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth
+    Config: configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k
+  - Name: mff_vit-base-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      Epochs: 800
+      Batch Size: 2048
+      FLOPs: 17581972224
+      Parameters: 85882692
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth
+    Config: configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k
+  - Name: vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.0
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth
+    Config: configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MFF
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.pth
+    Config: configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MFF
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 64.2
+    Weights:
+    Config: configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MFF
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 68.3
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth
+    Config: configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
--- a/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+++ b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+_base_ = '../mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py'
+
+randomness = dict(seed=2, diff_rank_seed=True)
+
+# dataset config
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ToPIL', to_rgb=True),
+    dict(type='torchvision/Resize', size=224),
+    dict(
+        type='torchvision/RandomCrop',
+        size=224,
+        padding=4,
+        padding_mode='reflect'),
+    dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+    dict(type='ToNumpy', to_bgr=True),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model config
+model = dict(
+    type='MFF', backbone=dict(type='MFFViT', out_indices=[0, 2, 4, 6, 8, 11]))
--- a/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
+++ b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
+_base_ = '../mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py'
+
+randomness = dict(seed=2, diff_rank_seed=True)
+
+# dataset config
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ToPIL', to_rgb=True),
+    dict(type='torchvision/Resize', size=224),
+    dict(
+        type='torchvision/RandomCrop',
+        size=224,
+        padding=4,
+        padding_mode='reflect'),
+    dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+    dict(type='ToNumpy', to_bgr=True),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model config
+model = dict(
+    type='MFF', backbone=dict(type='MFFViT', out_indices=[0, 2, 4, 6, 8, 11]))
--- a/configs/milan/README.md
+++ b/configs/milan/README.md
+# MILAN
+
+> [MILAN: Masked Image Pretraining on Language Assisted Representation](https://arxiv.org/pdf/2208.06049)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Self-attention based transformer models have been dominating many computer
+vision tasks in the past few years. Their superb model qualities heavily depend
+on the excessively large labeled image datasets. In order to reduce the reliance
+on large labeled datasets, reconstruction based masked autoencoders are gaining
+popularity, which learn high quality transferable representations from unlabeled
+images. For the same purpose, recent weakly supervised image pretraining methods
+explore language supervision from text captions accompanying the images. In this
+work, we propose masked image pretraining on language assisted representation,
+dubbed as MILAN. Instead of predicting raw pixels or low level features, our
+pretraining objective is to reconstruct the image features with substantial semantic
+signals that are obtained using caption supervision. Moreover, to accommodate our
+reconstruction target, we propose a more efficient prompting decoder architecture
+and a semantic aware mask sampling mechanism, which further advance the
+transfer performance of the pretrained model. Experimental results demonstrate
+that MILAN delivers higher accuracy than the previous works. When the masked
+autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input
+resolution of 224×224, MILAN achieves a top-1 accuracy of 85.4% on ViTB/16, surpassing previous state-of-the-arts by 1%. In the downstream semantic
+segmentation task, MILAN achieves 52.7 mIoU using ViT-B/16 backbone on
+ADE20K dataset, outperforming previous masked pretraining results by 4 points.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/205210369-41a65c4c-bcd4-4147-91ea-c6c9061ab455.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_milan-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('milan_vit-base-p16_16xb256-amp-coslr-400e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                            | Params (M) | Flops (G) |                           Config                            |                                  Download                                  |
+| :----------------------------------------------- | :--------: | :-------: | :---------------------------------------------------------: | :------------------------------------------------------------------------: |
+| `milan_vit-base-p16_16xb256-amp-coslr-400e_in1k` |   111.91   |   17.58   | [config](milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_milan-pre_8xb128-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   85.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.json) |
+| `vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   78.90   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.json) |
+
+## Citation
+
+```bibtex
+@article{Hou2022MILANMI,
+  title={MILAN: Masked Image Pretraining on Language Assisted Representation},
+  author={Zejiang Hou and Fei Sun and Yen-Kuang Chen and Yuan Xie and S. Y. Kung},
+  journal={ArXiv},
+  year={2022}
+}
+```
--- a/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+++ b/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
--- a/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
+++ b/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]),
+    data_preprocessor=dict(
+        num_classes=1000,
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        to_rgb=True,
+    ))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=3.2, weight_decay=0.0, momentum=0.9),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
--- a/configs/milan/metafile.yml
+++ b/configs/milan/metafile.yml
+Collections:
+  - Name: MILAN
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 16x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: 'MILAN: Masked Image Pretraining on Language Assisted Representation'
+      URL: https://arxiv.org/pdf/2208.06049
+    README: configs/milan/README.md
+
+Models:
+  - Name: milan_vit-base-p16_16xb256-amp-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907584
+      Training Data: ImageNet-1k
+    In Collection: MILAN
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth
+    Config: configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+    Downstream:
+      - vit-base-p16_milan-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k
+  - Name: vit-base-p16_milan-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MILAN
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.3
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
+    Config: configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MILAN
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.9
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth
+    Config: configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
--- a/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+++ b/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='MILAN',
+    backbone=dict(
+        type='MILANViT',
+        arch='b',
+        patch_size=16,
+        mask_ratio=0.75,
+        init_cfg=[
+            dict(type='Xavier', distribution='uniform', layer='Linear'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type='MILANPretrainDecoder',
+        init_cfg=[
+            dict(type='Xavier', distribution='uniform', layer='Linear'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    head=dict(
+        type='MIMHead',
+        loss=dict(
+            type='CosineSimilarityLoss', shift_factor=2.0, scale_factor=2.0),
+    ),
+    target_generator=dict(
+        type='CLIPGenerator',
+        tokenizer_path=  # noqa
+        'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+    ),
+    init_cfg=None)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
--- a/configs/minigpt4/README.md
+++ b/configs/minigpt4/README.md
+# MiniGPT4
+
+> [MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models](https://arxiv.org/abs/2304.10592)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model's generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/1d8f328d-6c91-493e-8992-29e84a0fc3c8" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('minigpt-4_vicuna-7b_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'This image shows a small dog and a kitten sitting on a blanket in a field of flowers. The dog is looking up at the kitten with a playful expression on its face. The background is a colorful striped blanket, and there are flowers all around them. The image is well composed with the two animals sitting in the center of the frame, surrounded by the flowers and blanket.'}
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+For Vicuna model, please refer to [MiniGPT-4 page](https://github.com/Vision-CAIR/MiniGPT-4) for preparation guidelines.
+
+### Pretrained models
+
+| Model                           | Params (M) | Flops (G) |                   Config                   |                                                   Download                                                   |
+| :------------------------------ | :--------: | :-------: | :----------------------------------------: | :----------------------------------------------------------------------------------------------------------: |
+| `minigpt-4_baichuan-7b_caption` |  8094.77   |    N/A    | [config](minigpt-4_baichuan-7b_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_baichuan7b_20231011-5dca7ed6.pth) |
+| `minigpt-4_vicuna-7b_caption`\* |  8121.32   |    N/A    |  [config](minigpt-4_vicuna-7b_caption.py)  | [model](https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_vicuna7b_20230615-714b5f52.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Vision-CAIR/MiniGPT-4/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{zhu2023minigpt,
+  title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
+  author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
+  journal={arXiv preprint arXiv:2304.10592},
+  year={2023}
+}
+```
--- a/configs/minigpt4/metafile.yml
+++ b/configs/minigpt4/metafile.yml
+Collections:
+  - Name: MiniGPT4
+    Metadata:
+      Architecture:
+        - Transformer
+        - Gated Cross-Attention Dense
+    Paper:
+      Title: 'MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models'
+      URL: https://arxiv.org/abs/2304.10592
+    README: configs/minigpt4/README.md
+
+Models:
+  - Name: minigpt-4_vicuna-7b_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8121315072
+    In Collection: MiniGPT4
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_vicuna7b_20230615-714b5f52.pth
+    Config: configs/minigpt4/minigpt-4_vicuna-7b_caption.py
+    Converted From:
+      Weights: https://github.com/Vision-CAIR/MiniGPT-4/tree/main
+      Code: https://github.com/Vision-CAIR/MiniGPT-4/tree/main
+  - Name: minigpt-4_baichuan-7b_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8094769024
+    In Collection: MiniGPT4
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_baichuan7b_20231011-5dca7ed6.pth
+    Config: configs/minigpt4/minigpt-4_baichuan-7b_caption.py
--- a/configs/minigpt4/minigpt-4_baichuan-7b_caption.py
+++ b/configs/minigpt4/minigpt-4_baichuan-7b_caption.py
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='CleanCaption',
+        keys='chat_content',
+        remove_chars='',
+        lowercase=False),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['chat_content', 'lang'],
+        meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=4,
+    dataset=dict(
+        type='MiniGPT4Dataset',
+        data_root='YOUR_DATA_DIRECTORY',
+        ann_file='YOUR_DATA_FILE',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    drop_last=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+test_dataloader = dict(
+    batch_size=1,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline))
+
+# model settings
+model = dict(
+    type='MiniGPT4',
+    vision_encoder=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        frozen_stages=39,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw',
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth'  # noqa
+    ),
+    q_former_model=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32,
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_qformer_20230615-1dfa889c.pth'  # noqa
+    ),
+    lang_encoder=dict(
+        type='AutoModelForCausalLM',
+        name_or_path='baichuan-inc/baichuan-7B',
+        trust_remote_code=True),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='baichuan-inc/baichuan-7B',
+        trust_remote_code=True),
+    task='caption',
+    prompt_template=dict([('en', '###Ask: {} ###Answer: '),
+                          ('zh', '###问：{} ###答：')]),
+    raw_prompts=dict([
+        ('en', [('<Img><ImageHere></Img> '
+                 'Describe this image in detail.'),
+                ('<Img><ImageHere></Img> '
+                 'Take a look at this image and describe what you notice.'),
+                ('<Img><ImageHere></Img> '
+                 'Please provide a detailed description of the picture.'),
+                ('<Img><ImageHere></Img> '
+                 'Could you describe the contents of this image for me?')]),
+        ('zh', [('<Img><ImageHere></Img> '
+                 '详细描述这张图片。'), ('<Img><ImageHere></Img> '
+                                '浏览这张图片并描述你注意到什么。'),
+                ('<Img><ImageHere></Img> '
+                 '请对这张图片进行详细的描述。'),
+                ('<Img><ImageHere></Img> '
+                 '你能为我描述这张图片的内容吗？')])
+    ]),
+    max_txt_len=160,
+    end_sym='###')
+
+strategy = dict(
+    type='DeepSpeedStrategy',
+    fp16=dict(
+        enabled=True,
+        auto_cast=False,
+        fp16_master_weights_and_grads=False,
+        loss_scale=0,
+        loss_scale_window=1000,
+        hysteresis=1,
+        min_loss_scale=1,
+        initial_scale_power=16,
+    ),
+    inputs_to_half=[0],
+    zero_optimization=dict(
+        stage=2,
+        allgather_partitions=True,
+        allgather_bucket_size=2e8,
+        reduce_scatter=True,
+        reduce_bucket_size='auto',
+        overlap_comm=True,
+        contiguous_gradients=True,
+    ),
+)
+
+# schedule settings
+optim_wrapper = dict(
+    type='DeepSpeedOptimWrapper',
+    optimizer=dict(type='AdamW', lr=1e-3, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-3 / 500,
+        by_epoch=False,
+        begin=0,
+        end=500,
+    ),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=2e-4,
+        by_epoch=False,
+        begin=500,
+    ),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=6)
+test_cfg = dict()
+
+runner_type = 'FlexibleRunner'
+
+default_hooks = dict(
+    checkpoint=dict(
+        type='CheckpointHook',
+        interval=1,
+        by_epoch=True,
+        save_last=True,
+        max_keep_ckpts=1,
+    ))
--- a/configs/minigpt4/minigpt-4_vicuna-7b_caption.py
+++ b/configs/minigpt4/minigpt-4_vicuna-7b_caption.py
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(batch_size=1, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='MiniGPT4',
+    vision_encoder=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        frozen_stages=39,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw',
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth'  # noqa
+    ),
+    q_former_model=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32,
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_qformer_20230615-1dfa889c.pth'  # noqa
+    ),
+    lang_encoder=dict(
+        type='AutoModelForCausalLM', name_or_path='YOUR_PATH_TO_VICUNA'),
+    tokenizer=dict(type='LlamaTokenizer', name_or_path='YOUR_PATH_TO_VICUNA'),
+    task='caption',
+    prompt_template=dict([('en', '###Ask: {} ###Answer: '),
+                          ('zh', '###问：{} ###答：')]),
+    raw_prompts=dict([
+        ('en', [('<Img><ImageHere></Img> '
+                 'Describe this image in detail.'),
+                ('<Img><ImageHere></Img> '
+                 'Take a look at this image and describe what you notice.'),
+                ('<Img><ImageHere></Img> '
+                 'Please provide a detailed description of the picture.'),
+                ('<Img><ImageHere></Img> '
+                 'Could you describe the contents of this image for me?')]),
+        ('zh', [('<Img><ImageHere></Img> '
+                 '详细描述这张图片。'), ('<Img><ImageHere></Img> '
+                                '浏览这张图片并描述你注意到什么。'),
+                ('<Img><ImageHere></Img> '
+                 '请对这张图片进行详细的描述。'),
+                ('<Img><ImageHere></Img> '
+                 '你能为我描述这张图片的内容吗？')])
+    ]),
+    max_txt_len=160,
+    end_sym='###')
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=5)
+val_cfg = dict()
+test_cfg = dict()
--- a/configs/mixmim/README.md
+++ b/configs/mixmim/README.md
+# MixMIM
+
+> [MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning](https://arxiv.org/abs/2205.13137)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this study, we propose Mixed and Masked Image Modeling (MixMIM), a
+simple but efficient MIM method that is applicable to various hierarchical Vision
+Transformers. Existing MIM methods replace a random subset of input tokens with
+a special [MASK] symbol and aim at reconstructing original image tokens from
+the corrupted image. However, we find that using the [MASK] symbol greatly
+slows down the training and causes training-finetuning inconsistency, due to the
+large masking ratio (e.g., 40% in BEiT). In contrast, we replace the masked tokens
+of one image with visible tokens of another image, i.e., creating a mixed image.
+We then conduct dual reconstruction to reconstruct the original two images from
+the mixed input, which significantly improves efficiency. While MixMIM can
+be applied to various architectures, this paper explores a simpler but stronger
+hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical
+results demonstrate that MixMIM can learn high-quality visual representations
+efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1
+accuracy on ImageNet-1K by pretraining for 600 epochs, setting a new record for
+neural networks with comparable model sizes (e.g., ViT-B) among MIM methods.
+Besides, its transferring performances on the other 6 datasets show MixMIM has
+better FLOPs / performance tradeoff than previous MIM methods
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/56866854/202853730-d26fb3d7-e5e8-487a-aad5-e3d4600cef87.png"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mixmim_mixmim-base_16xb128-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                        | Params (M) | Flops (G) |                         Config                          |                                      Download                                      |
+| :------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------: | :--------------------------------------------------------------------------------: |
+| `mixmim_mixmim-base_16xb128-coslr-300e_in1k` |   114.67   |   16.35   | [config](mixmim_mixmim-base_16xb128-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k` | [MIXMIM](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth) |   88.34    |   16.35   |   84.63   | [config](benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.json) |
+
+## Citation
+
+```bibtex
+@article{MixMIM2022,
+  author  = {Jihao Liu, Xin Huang, Yu Liu, Hongsheng Li},
+  journal = {arXiv:2205.13137},
+  title   = {MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning},
+  year    = {2022},
+}
+```