upload mmpretrain

d476eeba · renzhc · 62b8498e · d476eeba · d476eeba · d476eeba
Commit d476eeba authored Sep 02, 2024 by renzhc
20 changed files
--- a/configs/eva/metafile.yml
+++ b/configs/eva/metafile.yml
+Collections:
+  - Name: EVA
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'EVA: Exploring the Limits of Masked Visual Representation Learning at
+        Scale'
+      URL: https://arxiv.org/abs/2211.07636
+    README: configs/eva/README.md
+    Code:
+      URL: null
+      Version: null
+Models:
+  - Name: eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111776512
+      Training Data: ImageNet-1k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth
+    Config: configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
+    Downstream:
+      - vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
+  - Name: vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth
+    Config: configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.0
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.pth
+    Config: configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
+  - Name: beit-l-p14_eva-pre_3rdparty_in1k-196px
+    Metadata:
+      FLOPs: 61565981696
+      Parameters: 304142312
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 87.94
+          Top 5 Accuracy: 98.5
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-196px_20221214-2adf4d28.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_1k_ft_88p0.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
+    Metadata:
+      FLOPs: 61565981696
+      Parameters: 304142312
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.58
+          Top 5 Accuracy: 98.65
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px_20221213-b730c7e7.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_21k_to_1k_ft_88p6.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-l-p14_3rdparty-eva_in21k
+    Metadata:
+      FLOPs: 81075147776
+      Parameters: 303178752
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth
+    Config: configs/eva/eva-l-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-l-p14_eva-pre_3rdparty_in21k
+      - beit-l-p14_eva-pre_3rdparty_in1k-336px
+      - beit-l-p14_eva-pre_3rdparty_in1k-196px
+  - Name: beit-l-p14_eva-pre_3rdparty_in21k
+    Metadata:
+      FLOPs: 81075147776
+      Parameters: 303178752
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in21k_20221213-8f194fa2.pth
+    Config: configs/eva/eva-l-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_21k_ft.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-l-p14_eva-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 191100916736
+      Parameters: 304531432
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.66
+          Top 5 Accuracy: 98.75
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-336px_20221214-07785cfd.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_1k_ft_88p65.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
+      - beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
+  - Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 191100916736
+      Parameters: 304531432
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.17
+          Top 5 Accuracy: 98.86
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px_20221213-f25b7634.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_21k_to_1k_ft_89p2.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-g-p16_3rdparty-eva_30m
+    Metadata:
+      FLOPs: 203517463424
+      Parameters: 1011315072
+      Training Data:
+        - merged-30M
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p16_3rdparty_30m_20221213-7bed23ee.pth
+    Config: configs/eva/eva-g-p16_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14to16.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-g-p14_3rdparty-eva_30m
+    Metadata:
+      FLOPs: 267174833024
+      Parameters: 1011596672
+      Training Data:
+        - merged-30M
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_3rdparty_30m_20221213-3b7aca97.pth
+    Config: configs/eva/eva-g-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-g-p14_eva-30m-pre_3rdparty_in21k
+  - Name: beit-g-p14_eva-30m-pre_3rdparty_in21k
+    Metadata:
+      FLOPs: 267174833024
+      Parameters: 1011596672
+      Training Data:
+        - merged-30M
+        - ImageNet-21k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth
+    Config: configs/eva/eva-g-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_224px_psz14.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
+      - beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
+  - Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 620642757504
+      Parameters: 1013005672
+      Training Data:
+        - merged-30M
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.61
+          Top 5 Accuracy: 98.93
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px_20221213-210f9071.pth
+    Config: configs/eva/eva-g-p14_8xb16_in1k-336px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_336px_psz14_ema_89p6.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
+    Metadata:
+      FLOPs: 1906761591680
+      Parameters: 1014447464
+      Training Data:
+        - merged-30M
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.71
+          Top 5 Accuracy: 98.96
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px_20221213-fa1c3652.pth
+    Config: configs/eva/eva-g-p14_8xb16_in1k-560px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_560px_psz14_ema_89p7.pt
+      Code: https://github.com/baaivision/EVA
--- a/configs/eva02/README.md
+++ b/configs/eva02/README.md
+# EVA-02
+> [EVA-02: A Visual Representation for Neon Genesis](https://arxiv.org/abs/2303.11331)
+<!-- [ALGORITHM] -->
+## Abstract
+We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set.  Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open accessand open research, we release the complete suite of EVA-02 to the community.
+<div align=center>
+<img src="https://user-images.githubusercontent.com/40905160/229037980-b83dceb5-41d6-406c-a20b-63b83c80136d.png" width="70%" alt="TrV builds upon the original plain ViT architecture and includes several enhancements: SwinGLU FFN, sub-LN, 2D RoPE, and JAX weight initialization. To keep the parameter & FLOPs consistent with the baseline, the FFN hidden dim of SwiGLU is 2/3× of the typical MLP counterpart."/>
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Predict image**
+```python
+from mmpretrain import inference_model
+predict = inference_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+**Use the model**
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', pretrained=True)
+inputs = torch.rand(1, 3, 336, 336)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+**Train/Test Command**
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+Train:
+```shell
+python tools/train.py configs/eva02/eva02-tiny-p14_in1k.py
+```
+Test:
+```shell
+python tools/test.py configs/eva02/eva02-tiny-p14_in1k.py /path/to/eva02-tiny-p14_in1k.pth
+```
+<!-- [TABS-END] -->
+## Models and results
+### Pretrained models
+| Model                             | Params (M) | Flops (G) |                Config                 |                                                   Download                                                    |
+| :-------------------------------- | :--------: | :-------: | :-----------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
+| `vit-tiny-p14_eva02-pre_in21k`\*  |    5.50    |   1.70    | [config](eva02-tiny-p14_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth)  |
+| `vit-small-p14_eva02-pre_in21k`\* |   21.62    |   6.14    | [config](eva02-small-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth) |
+| `vit-base-p14_eva02-pre_in21k`\*  |   85.77    |   23.22   | [config](eva02-base-p14_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth)  |
+| `vit-large-p14_eva02-pre_in21k`\* |   303.29   |   81.15   | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth) |
+| `vit-large-p14_eva02-pre_m38m`\*  |   303.29   |   81.15   | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth)  |
+- The input size / patch size of MIM pre-trained EVA-02 is `224x224` / `14x14`.
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA).*
+### Image Classification on ImageNet-1k
+#### (*w/o* IN-21K intermediate fine-tuning)
+| Model                                                 |      Pretrain      | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                         Download                          |
+| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
+| `vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px`\*  | EVA02 ImageNet-21k |    5.76    |   4.68    |   80.69   |   95.54   | [config](./eva02-tiny-p14_in1k.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth) |
+| `vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px`\* | EVA02 ImageNet-21k |   22.13    |   15.48   |   85.78   |   97.60   | [config](./eva02-small-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth) |
+| `vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px`\*  | EVA02 ImageNet-21k |   87.13    |  107.11   |   88.29   |   98.53   | [config](./eva02-base-p14_in1k.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth) |
+*Models with * are converted from the  [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
+#### (*w* IN-21K intermediate fine-tuning)
+| Model                                                 |      Pretrain      | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                         Download                          |
+| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
+| `vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k |   87.13    |  107.11   |   88.47   |   98.62   | [config](./eva02-base-p14_in1k.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth) |
+| `vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k |   305.08   |  362.33   |   89.65   |   98.95   | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth) |
+| `vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px`\* |  EVA02 Merged-38M  |   305.10   |  362.33   |   89.83   |   99.00   | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth) |
+*Models with * are converted from the  [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
+## Citation
+```bibtex
+@article{EVA-02,
+  title={EVA-02: A Visual Representation for Neon Genesis},
+  author={Yuxin Fang and Quan Sun and Xinggang Wang and Tiejun Huang and Xinlong Wang and Yue Cao},
+  journal={arXiv preprint arXiv:2303.11331},
+  year={2023}
+}
+```
--- a/configs/eva02/eva02-base-p14_headless.py
+++ b/configs/eva02/eva02-base-p14_headless.py
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='b',
+        img_size=224,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/eva02/eva02-base-p14_in1k.py
+++ b/configs/eva02/eva02-base-p14_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_448.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='b',
+        img_size=448,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
--- a/configs/eva02/eva02-large-p14_headless.py
+++ b/configs/eva02/eva02-large-p14_headless.py
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='l',
+        img_size=224,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/eva02/eva02-large-p14_in1k.py
+++ b/configs/eva02/eva02-large-p14_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_448.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='l',
+        img_size=448,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
--- a/configs/eva02/eva02-small-p14_headless.py
+++ b/configs/eva02/eva02-small-p14_headless.py
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='s',
+        img_size=224,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/eva02/eva02-small-p14_in1k.py
+++ b/configs/eva02/eva02-small-p14_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='s',
+        img_size=336,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
--- a/configs/eva02/eva02-tiny-p14_headless.py
+++ b/configs/eva02/eva02-tiny-p14_headless.py
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='t',
+        img_size=224,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
--- a/configs/eva02/eva02-tiny-p14_in1k.py
+++ b/configs/eva02/eva02-tiny-p14_in1k.py
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='t',
+        img_size=336,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
--- a/configs/eva02/metafile.yml
+++ b/configs/eva02/metafile.yml
+Collections:
+  - Name: EVA02
+    Metadata:
+      Architecture:
+        - Rotary Position Embedding
+        - Sub Layer Normalization
+        - SwiGLU
+    Paper:
+      Title: 'EVA-02: A Visual Representation for Neon Genesis'
+      URL: https://arxiv.org/abs/2303.11331
+    README: configs/eva02/README.md
+Models:
+  - Name: vit-tiny-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 1703439360
+      Parameters: 5504064
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth
+    Config: configs/eva02/eva02-tiny-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_Ti_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
+  - Name: vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 4675416000
+      Parameters: 5758888
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 80.69
+          Top 5 Accuracy: 95.54
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth
+    Config: configs/eva02/eva02-tiny-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_Ti_pt_in21k_ft_in1k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-small-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 6135404544
+      Parameters: 21624960
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth
+    Config: configs/eva02/eva02-small-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_S_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
+  - Name: vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 15476744064
+      Parameters: 22133608
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.78
+          Top 5 Accuracy: 97.60
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth
+    Config: configs/eva02/eva02-small-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_S_pt_in21k_ft_in1k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-base-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 23216492544
+      Parameters: 85766400
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth
+    Config: configs/eva02/eva02-base-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_B_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
+      - vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+  - Name: vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 107105984256
+      Parameters: 87126760
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.29
+          Top 5 Accuracy: 98.53
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth
+    Config: configs/eva02/eva02-base-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_B_pt_in21k_ft_in1k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 107105984256
+      Parameters: 87126760
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.47
+          Top 5 Accuracy: 98.62
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth
+    Config: configs/eva02/eva02-base-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_B_pt_in21k_medft_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-large-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 81146703792
+      Parameters: 303291328
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth
+    Config: configs/eva02/eva02-large-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+  - Name: vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 362333836208
+      Parameters: 305104808
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.65
+          Top 5 Accuracy: 98.95
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth
+    Config: configs/eva02/eva02-large-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_in21k_medft_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-large-p14_eva02-pre_m38m
+    Metadata:
+      FLOPs: 81146703792
+      Parameters: 303291328
+      Training Data:
+        - Merged-38M
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth
+    Config: configs/eva02/eva02-large-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_m38m_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
+  - Name: vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 362333836208
+      Parameters: 305104808
+      Training Data:
+        - Merged-38M
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.83
+          Top 5 Accuracy: 99.00
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth
+    Config: configs/eva02/eva02-large-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_m38m_medft_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
--- a/configs/flamingo/README.md
+++ b/configs/flamingo/README.md
+# Flamingo
+> [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)
+<!-- [ALGORITHM] -->
+## Abstract
+Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/236371424-3b9d2e16-3966-4c64-8b87-e33fd6348824.png" width="80%"/>
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Use the model**
+```python
+from mmpretrain import inference_model
+result = inference_model('flamingo_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'A dog and a cat are looking at each other. '}
+```
+**Test Command**
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+Test:
+```shell
+python tools/test.py configs/flamingo/flamingo_zeroshot_caption.py https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+```
+<!-- [TABS-END] -->
+## Models and results
+### Image Caption on COCO
+| Model                                  | Params (G) | CIDER |                 Config                 |                                                   Download                                                    |
+| :------------------------------------- | :--------: | :---: | :------------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
+| `flamingo_3rdparty-zeroshot_caption`\* |   8.220    | 65.50 | [config](flamingo_zeroshot_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
+*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
+### Visual Question Answering on VQAv2
+| Model                              | Params (G) | Accuracy |               Config               |                                                      Download                                                      |
+| :--------------------------------- | :--------: | :------: | :--------------------------------: | :----------------------------------------------------------------------------------------------------------------: |
+| `flamingo_3rdparty-zeroshot_vqa`\* |    8.22    |  43.50   | [config](flamingo_zeroshot_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
+*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
+## Citation
+```bibtex
+@article{Alayrac2022FlamingoAV,
+  title={Flamingo: a Visual Language Model for Few-Shot Learning},
+  author={Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan},
+  journal={ArXiv},
+  year={2022},
+  volume={abs/2204.14198}
+}
+```
+```bibtex
+@software{anas_awadalla_2023_7733589,
+  author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig},
+  title = {OpenFlamingo},
+  month        = mar,
+  year         = 2023,
+  publisher    = {Zenodo},
+  version      = {v0.1.1},
+  doi          = {10.5281/zenodo.7733589},
+  url          = {https://doi.org/10.5281/zenodo.7733589}
+}
+```
--- a/configs/flamingo/flamingo_fewshot_caption.py
+++ b/configs/flamingo/flamingo_fewshot_caption.py
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='caption',
+    shot_prompt_tmpl='<image>Output:{caption}<|endofchunk|>',
+    final_prompt_tmpl='<image>Output:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0))
+# data settings
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+test_pipeline = [
+    dict(
+        type='ApplyToList',
+        # Flamingo requires to load multiple images during few-shot inference.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='ResizeEdge',
+                scale=224,
+                interpolation='bicubic',
+                backend='pillow'),
+            dict(type='CenterCrop', crop_size=(224, 224)),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption', 'shots'],
+        meta_keys=['image_id']),
+]
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/captions_train2014.json',
+        data_prefix=dict(img_path='train2014'),
+        pipeline=test_pipeline,
+        num_shots=2,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/captions_train2014.json')
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
--- a/configs/flamingo/flamingo_fewshot_vqa.py
+++ b/configs/flamingo/flamingo_fewshot_vqa.py
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='vqa',
+    shot_prompt_tmpl=
+    '<image>Question:{question} Short Answer:{answer}<|endofchunk|>',
+    final_prompt_tmpl='<image>Question:{question} Short Answer:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
+# data settings
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+test_pipeline = [
+    dict(
+        type='ApplyToList',
+        # Flamingo requires to load multiple images during few-shot inference.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='ResizeEdge',
+                scale=224,
+                interpolation='bicubic',
+                backend='pillow'),
+            dict(type='CenterCrop', crop_size=(224, 224)),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+        meta_keys=['image_id']),
+]
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline,
+        num_shots=2,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
--- a/configs/flamingo/flamingo_zeroshot_caption.py
+++ b/configs/flamingo/flamingo_zeroshot_caption.py
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+zeroshot_prompt = (
+    'Output:A child holding a flowered umbrella and petting a yak.<|endofchunk|>'  # noqa: E501
+    'Output:The child is holding a brush close to his mouth.<|endofchunk|>'  # noqa: E501
+)
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='caption',
+    zeroshot_prompt=zeroshot_prompt,
+    final_prompt_tmpl='<image>Output:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0),
+)
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption'],
+        meta_keys=['image_id'],
+    ),
+]
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/captions_train2014.json',
+        data_prefix=dict(img_path='train2014'),
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/captions_train2014.json')
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
--- a/configs/flamingo/flamingo_zeroshot_vqa.py
+++ b/configs/flamingo/flamingo_zeroshot_vqa.py
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+zeroshot_prompt = (
+    'Question:What is this photo taken looking through? Short Answer:pitcher<|endofchunk|>'  # noqa: E501
+    'Question:How many people are wearing shorts in the forefront of this photo? Short Answer:4<|endofchunk|>'  # noqa: E501
+)
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='vqa',
+    zeroshot_prompt=zeroshot_prompt,
+    final_prompt_tmpl='<image>Question:{question} Short Answer:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+        meta_keys=['image_id'],
+    ),
+]
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
--- a/configs/flamingo/metafile.yml
+++ b/configs/flamingo/metafile.yml
+Collections:
+  - Name: Flamingo
+    Metadata:
+      Architecture:
+        - Transformer
+        - Gated Cross-Attention Dense
+    Paper:
+      Title: 'Flamingo: a Visual Language Model for Few-Shot Learning'
+      URL: https://arxiv.org/abs/2204.14198
+    README: configs/flamingo/README.md
+Models:
+  - Name: flamingo_3rdparty-zeroshot_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Flamingo
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          CIDER: 65.50  # Report from the official repo
+    Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+    Config: configs/flamingo/flamingo_zeroshot_caption.py
+    Converted From:
+      Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
+      Code: https://github.com/mlfoundations/open_flamingo
+  - Name: flamingo_3rdparty-zeroshot_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Flamingo
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 43.50  # Report from the official repo
+    Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+    Config: configs/flamingo/flamingo_zeroshot_vqa.py
+    Converted From:
+      Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
+      Code: https://github.com/mlfoundations/open_flamingo
--- a/configs/glip/README.md
+++ b/configs/glip/README.md
+# GLIP
+> [Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857)
+<!-- [ALGORITHM] -->
+## Abstract
+This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
+<div align="center">
+<img src="https://github.com/microsoft/GLIP/blob/main/docs/lead.png" width="70%"/>
+</div>
+## How to use it?
+<!-- [TABS-BEGIN] -->
+**Use the model**
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('swin-t_glip-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+<!-- [TABS-END] -->
+## Results and models
+### Pre-trained models
+The pre-trained models are used to fine-tune, and therefore don't have evaluation results.
+| Model                                       |          Pretrain          | resolution |                                                       Download                                                        |
+| :------------------------------------------ | :------------------------: | :--------: | :-------------------------------------------------------------------------------------------------------------------: |
+| GLIP-T (`swin-t_glip-pre_3rdparty`)\*       |    O365,GoldG,CC3M,SBU     |  224x224   |    [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth)    |
+| GLIP-L (`swin-l_glip-pre_3rdparty_384px`)\* | FourODs,GoldG,CC3M+12M,SBU |  384x384   | [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth) |
+*Models with * are converted from the [official repo](https://github.com/microsoft/GLIP).*
+## Citation
+```bibtex
+@inproceedings{li2021grounded,
+      title={Grounded Language-Image Pre-training},
+      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
+      year={2022},
+      booktitle={CVPR},
+}
+```
--- a/configs/glip/glip-l_headless.py
+++ b/configs/glip/glip-l_headless.py
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='large',
+        img_size=384,
+        out_indices=(1, 2, 3),  # original weight is for detection
+        stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    # convert image from BGR to RGB
+    to_rgb=False,
+)
--- a/configs/glip/glip-t_headless.py
+++ b/configs/glip/glip-t_headless.py
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='tiny',
+        img_size=224,
+        out_indices=(1, 2, 3),  # original weight is for detection
+    ),
+    neck=None,
+    head=None)
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    # convert image from BGR to RGB
+    to_rgb=False,
+)