Commit d476eeba authored by renzhc's avatar renzhc
Browse files

upload mmpretrain

parent 62b8498e
Pipeline #1662 failed with stages
in 0 seconds
Collections:
- Name: EVA
Metadata:
Architecture:
- Attention Dropout
- Convolution
- Dense Connections
- Dropout
- GELU
- Layer Normalization
- Multi-Head Attention
- Scaled Dot-Product Attention
- Tanh Activation
Paper:
Title: 'EVA: Exploring the Limits of Masked Visual Representation Learning at
Scale'
URL: https://arxiv.org/abs/2211.07636
README: configs/eva/README.md
Code:
URL: null
Version: null
Models:
- Name: eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k
Metadata:
Epochs: 400
Batch Size: 4096
FLOPs: 17581972224
Parameters: 111776512
Training Data: ImageNet-1k
In Collection: EVA
Results: null
Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth
Config: configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
Downstream:
- vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
- vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
- Name: vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
Metadata:
Epochs: 100
Batch Size: 1024
FLOPs: 17581215744
Parameters: 86566120
Training Data: ImageNet-1k
In Collection: EVA
Results:
- Task: Image Classification
Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 83.7
Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth
Config: configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
- Name: vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
Metadata:
Epochs: 100
Batch Size: 16384
FLOPs: 17581972992
Parameters: 86567656
Training Data: ImageNet-1k
In Collection: EVA
Results:
- Task: Image Classification
Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 69.0
Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.pth
Config: configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
- Name: beit-l-p14_eva-pre_3rdparty_in1k-196px
Metadata:
FLOPs: 61565981696
Parameters: 304142312
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: EVA
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 87.94
Top 5 Accuracy: 98.5
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-196px_20221214-2adf4d28.pth
Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_1k_ft_88p0.pt
Code: https://github.com/baaivision/EVA
- Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
Metadata:
FLOPs: 61565981696
Parameters: 304142312
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: EVA
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 88.58
Top 5 Accuracy: 98.65
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px_20221213-b730c7e7.pth
Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_21k_to_1k_ft_88p6.pt
Code: https://github.com/baaivision/EVA
- Name: beit-l-p14_3rdparty-eva_in21k
Metadata:
FLOPs: 81075147776
Parameters: 303178752
Training Data:
- ImageNet-21k
In Collection: EVA
Results: null
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth
Config: configs/eva/eva-l-p14_headless.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14.pt
Code: https://github.com/baaivision/EVA
Downstream:
- beit-l-p14_eva-pre_3rdparty_in21k
- beit-l-p14_eva-pre_3rdparty_in1k-336px
- beit-l-p14_eva-pre_3rdparty_in1k-196px
- Name: beit-l-p14_eva-pre_3rdparty_in21k
Metadata:
FLOPs: 81075147776
Parameters: 303178752
Training Data:
- ImageNet-21k
In Collection: EVA
Results: null
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in21k_20221213-8f194fa2.pth
Config: configs/eva/eva-l-p14_headless.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_21k_ft.pt
Code: https://github.com/baaivision/EVA
- Name: beit-l-p14_eva-pre_3rdparty_in1k-336px
Metadata:
FLOPs: 191100916736
Parameters: 304531432
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: EVA
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 88.66
Top 5 Accuracy: 98.75
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-336px_20221214-07785cfd.pth
Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_1k_ft_88p65.pt
Code: https://github.com/baaivision/EVA
Downstream:
- beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
- beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
- Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
Metadata:
FLOPs: 191100916736
Parameters: 304531432
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: EVA
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 89.17
Top 5 Accuracy: 98.86
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px_20221213-f25b7634.pth
Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_21k_to_1k_ft_89p2.pt
Code: https://github.com/baaivision/EVA
- Name: beit-g-p16_3rdparty-eva_30m
Metadata:
FLOPs: 203517463424
Parameters: 1011315072
Training Data:
- merged-30M
In Collection: EVA
Results: null
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p16_3rdparty_30m_20221213-7bed23ee.pth
Config: configs/eva/eva-g-p16_headless.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14to16.pt
Code: https://github.com/baaivision/EVA
- Name: beit-g-p14_3rdparty-eva_30m
Metadata:
FLOPs: 267174833024
Parameters: 1011596672
Training Data:
- merged-30M
In Collection: EVA
Results: null
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_3rdparty_30m_20221213-3b7aca97.pth
Config: configs/eva/eva-g-p14_headless.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14.pt
Code: https://github.com/baaivision/EVA
Downstream:
- beit-g-p14_eva-30m-pre_3rdparty_in21k
- Name: beit-g-p14_eva-30m-pre_3rdparty_in21k
Metadata:
FLOPs: 267174833024
Parameters: 1011596672
Training Data:
- merged-30M
- ImageNet-21k
In Collection: EVA
Results: null
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth
Config: configs/eva/eva-g-p14_headless.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_224px_psz14.pt
Code: https://github.com/baaivision/EVA
Downstream:
- beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
- beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
- Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
Metadata:
FLOPs: 620642757504
Parameters: 1013005672
Training Data:
- merged-30M
- ImageNet-21k
- ImageNet-1k
In Collection: EVA
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 89.61
Top 5 Accuracy: 98.93
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px_20221213-210f9071.pth
Config: configs/eva/eva-g-p14_8xb16_in1k-336px.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_336px_psz14_ema_89p6.pt
Code: https://github.com/baaivision/EVA
- Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
Metadata:
FLOPs: 1906761591680
Parameters: 1014447464
Training Data:
- merged-30M
- ImageNet-21k
- ImageNet-1k
In Collection: EVA
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 89.71
Top 5 Accuracy: 98.96
Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px_20221213-fa1c3652.pth
Config: configs/eva/eva-g-p14_8xb16_in1k-560px.py
Converted From:
Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_560px_psz14_ema_89p7.pt
Code: https://github.com/baaivision/EVA
# EVA-02
> [EVA-02: A Visual Representation for Neon Genesis](https://arxiv.org/abs/2303.11331)
<!-- [ALGORITHM] -->
## Abstract
We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open accessand open research, we release the complete suite of EVA-02 to the community.
<div align=center>
<img src="https://user-images.githubusercontent.com/40905160/229037980-b83dceb5-41d6-406c-a20b-63b83c80136d.png" width="70%" alt="TrV builds upon the original plain ViT architecture and includes several enhancements: SwinGLU FFN, sub-LN, 2D RoPE, and JAX weight initialization. To keep the parameter & FLOPs consistent with the baseline, the FFN hidden dim of SwiGLU is 2/3× of the typical MLP counterpart."/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Predict image**
```python
from mmpretrain import inference_model
predict = inference_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', pretrained=True)
inputs = torch.rand(1, 3, 336, 336)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
**Train/Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Train:
```shell
python tools/train.py configs/eva02/eva02-tiny-p14_in1k.py
```
Test:
```shell
python tools/test.py configs/eva02/eva02-tiny-p14_in1k.py /path/to/eva02-tiny-p14_in1k.pth
```
<!-- [TABS-END] -->
## Models and results
### Pretrained models
| Model | Params (M) | Flops (G) | Config | Download |
| :-------------------------------- | :--------: | :-------: | :-----------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
| `vit-tiny-p14_eva02-pre_in21k`\* | 5.50 | 1.70 | [config](eva02-tiny-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth) |
| `vit-small-p14_eva02-pre_in21k`\* | 21.62 | 6.14 | [config](eva02-small-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth) |
| `vit-base-p14_eva02-pre_in21k`\* | 85.77 | 23.22 | [config](eva02-base-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth) |
| `vit-large-p14_eva02-pre_in21k`\* | 303.29 | 81.15 | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth) |
| `vit-large-p14_eva02-pre_m38m`\* | 303.29 | 81.15 | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth) |
- The input size / patch size of MIM pre-trained EVA-02 is `224x224` / `14x14`.
*Models with * are converted from the [official repo](https://github.com/baaivision/EVA).*
### Image Classification on ImageNet-1k
#### (*w/o* IN-21K intermediate fine-tuning)
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
| `vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px`\* | EVA02 ImageNet-21k | 5.76 | 4.68 | 80.69 | 95.54 | [config](./eva02-tiny-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth) |
| `vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px`\* | EVA02 ImageNet-21k | 22.13 | 15.48 | 85.78 | 97.60 | [config](./eva02-small-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth) |
| `vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k | 87.13 | 107.11 | 88.29 | 98.53 | [config](./eva02-base-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth) |
*Models with * are converted from the [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
#### (*w* IN-21K intermediate fine-tuning)
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
| `vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k | 87.13 | 107.11 | 88.47 | 98.62 | [config](./eva02-base-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth) |
| `vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k | 305.08 | 362.33 | 89.65 | 98.95 | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth) |
| `vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 Merged-38M | 305.10 | 362.33 | 89.83 | 99.00 | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth) |
*Models with * are converted from the [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@article{EVA-02,
title={EVA-02: A Visual Representation for Neon Genesis},
author={Yuxin Fang and Quan Sun and Xinggang Wang and Tiejun Huang and Xinlong Wang and Yue Cao},
journal={arXiv preprint arXiv:2303.11331},
year={2023}
}
```
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='b',
img_size=224,
patch_size=14,
sub_ln=True,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=None,
)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
# convert image from BGR to RGB
to_rgb=True,
)
_base_ = [
'../_base_/datasets/imagenet_bs16_eva_448.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='b',
img_size=448,
patch_size=14,
sub_ln=True,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=768,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='l',
img_size=224,
patch_size=14,
sub_ln=True,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=None,
)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
# convert image from BGR to RGB
to_rgb=True,
)
_base_ = [
'../_base_/datasets/imagenet_bs16_eva_448.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='l',
img_size=448,
patch_size=14,
sub_ln=True,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=1024,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='s',
img_size=224,
patch_size=14,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=None,
)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
# convert image from BGR to RGB
to_rgb=True,
)
_base_ = [
'../_base_/datasets/imagenet_bs16_eva_336.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='s',
img_size=336,
patch_size=14,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=384,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='t',
img_size=224,
patch_size=14,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=None,
)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
# convert image from BGR to RGB
to_rgb=True,
)
_base_ = [
'../_base_/datasets/imagenet_bs16_eva_336.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='t',
img_size=336,
patch_size=14,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=192,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
Collections:
- Name: EVA02
Metadata:
Architecture:
- Rotary Position Embedding
- Sub Layer Normalization
- SwiGLU
Paper:
Title: 'EVA-02: A Visual Representation for Neon Genesis'
URL: https://arxiv.org/abs/2303.11331
README: configs/eva02/README.md
Models:
- Name: vit-tiny-p14_eva02-pre_in21k
Metadata:
FLOPs: 1703439360
Parameters: 5504064
Training Data:
- ImageNet-21k
In Collection: EVA02
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth
Config: configs/eva02/eva02-tiny-p14_headless.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_Ti_pt_in21k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
Downstream:
- vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
- Name: vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
Metadata:
FLOPs: 4675416000
Parameters: 5758888
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: EVA02
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 80.69
Top 5 Accuracy: 95.54
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth
Config: configs/eva02/eva02-tiny-p14_in1k.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_Ti_pt_in21k_ft_in1k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
- Name: vit-small-p14_eva02-pre_in21k
Metadata:
FLOPs: 6135404544
Parameters: 21624960
Training Data:
- ImageNet-21k
In Collection: EVA02
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth
Config: configs/eva02/eva02-small-p14_headless.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_S_pt_in21k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
Downstream:
- vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
- Name: vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
Metadata:
FLOPs: 15476744064
Parameters: 22133608
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: EVA02
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 85.78
Top 5 Accuracy: 97.60
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth
Config: configs/eva02/eva02-small-p14_in1k.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_S_pt_in21k_ft_in1k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
- Name: vit-base-p14_eva02-pre_in21k
Metadata:
FLOPs: 23216492544
Parameters: 85766400
Training Data:
- ImageNet-21k
In Collection: EVA02
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth
Config: configs/eva02/eva02-base-p14_headless.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_B_pt_in21k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
Downstream:
- vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
- vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
- Name: vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
Metadata:
FLOPs: 107105984256
Parameters: 87126760
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: EVA02
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 88.29
Top 5 Accuracy: 98.53
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth
Config: configs/eva02/eva02-base-p14_in1k.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_B_pt_in21k_ft_in1k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
- Name: vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
Metadata:
FLOPs: 107105984256
Parameters: 87126760
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: EVA02
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 88.47
Top 5 Accuracy: 98.62
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth
Config: configs/eva02/eva02-base-p14_in1k.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_B_pt_in21k_medft_in21k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
- Name: vit-large-p14_eva02-pre_in21k
Metadata:
FLOPs: 81146703792
Parameters: 303291328
Training Data:
- ImageNet-21k
In Collection: EVA02
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth
Config: configs/eva02/eva02-large-p14_headless.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_in21k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
Downstream:
- vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
- Name: vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
Metadata:
FLOPs: 362333836208
Parameters: 305104808
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: EVA02
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 89.65
Top 5 Accuracy: 98.95
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth
Config: configs/eva02/eva02-large-p14_in1k.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_in21k_medft_in21k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
- Name: vit-large-p14_eva02-pre_m38m
Metadata:
FLOPs: 81146703792
Parameters: 303291328
Training Data:
- Merged-38M
In Collection: EVA02
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth
Config: configs/eva02/eva02-large-p14_headless.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_m38m_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
Downstream:
- vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
- Name: vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
Metadata:
FLOPs: 362333836208
Parameters: 305104808
Training Data:
- Merged-38M
- ImageNet-21k
- ImageNet-1k
In Collection: EVA02
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 89.83
Top 5 Accuracy: 99.00
Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth
Config: configs/eva02/eva02-large-p14_in1k.py
Converted From:
Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_m38m_medft_in21k_p14.pt
Code: https://github.com/baaivision/EVA/tree/master/EVA-02
# Flamingo
> [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)
<!-- [ALGORITHM] -->
## Abstract
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
<div align=center>
<img src="https://user-images.githubusercontent.com/26739999/236371424-3b9d2e16-3966-4c64-8b87-e33fd6348824.png" width="80%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Use the model**
```python
from mmpretrain import inference_model
result = inference_model('flamingo_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
print(result)
# {'pred_caption': 'A dog and a cat are looking at each other. '}
```
**Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Test:
```shell
python tools/test.py configs/flamingo/flamingo_zeroshot_caption.py https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
```
<!-- [TABS-END] -->
## Models and results
### Image Caption on COCO
| Model | Params (G) | CIDER | Config | Download |
| :------------------------------------- | :--------: | :---: | :------------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
| `flamingo_3rdparty-zeroshot_caption`\* | 8.220 | 65.50 | [config](flamingo_zeroshot_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
### Visual Question Answering on VQAv2
| Model | Params (G) | Accuracy | Config | Download |
| :--------------------------------- | :--------: | :------: | :--------------------------------: | :----------------------------------------------------------------------------------------------------------------: |
| `flamingo_3rdparty-zeroshot_vqa`\* | 8.22 | 43.50 | [config](flamingo_zeroshot_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@article{Alayrac2022FlamingoAV,
title={Flamingo: a Visual Language Model for Few-Shot Learning},
author={Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan},
journal={ArXiv},
year={2022},
volume={abs/2204.14198}
}
```
```bibtex
@software{anas_awadalla_2023_7733589,
author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig},
title = {OpenFlamingo},
month = mar,
year = 2023,
publisher = {Zenodo},
version = {v0.1.1},
doi = {10.5281/zenodo.7733589},
url = {https://doi.org/10.5281/zenodo.7733589}
}
```
_base_ = [
'../_base_/default_runtime.py',
]
# model settings
model = dict(
type='Flamingo',
tokenizer=dict(
type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
vision_encoder=dict(
type='VisionTransformer',
arch='l',
patch_size=14,
pre_norm=True,
norm_cfg=dict(type='LN', eps=1e-5),
layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
final_norm=False,
out_type='raw',
pretrained=(
'https://download.openmmlab.com/mmclassification/v0/clip/'
'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
),
lang_encoder=dict(
base=dict(
type='AutoModelForCausalLM',
name_or_path='decapoda-research/llama-7b-hf',
local_files_only=True),
adapter=dict(
type='FlamingoLMAdapter',
vis_hidden_size=1024,
cross_attn_every_n_layers=4,
use_media_placement_augmentation=False),
),
task='caption',
shot_prompt_tmpl='<image>Output:{caption}<|endofchunk|>',
final_prompt_tmpl='<image>Output:',
generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0))
# data settings
data_preprocessor = dict(
mean=[122.770938, 116.7460125, 104.09373615],
std=[68.5005327, 66.6321579, 70.32316305],
to_rgb=True,
)
test_pipeline = [
dict(
type='ApplyToList',
# Flamingo requires to load multiple images during few-shot inference.
scatter_key='img_path',
transforms=[
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=224,
interpolation='bicubic',
backend='pillow'),
dict(type='CenterCrop', crop_size=(224, 224)),
],
collate_keys=['img', 'scale_factor', 'ori_shape'],
),
dict(
type='PackInputs',
algorithm_keys=['gt_caption', 'shots'],
meta_keys=['image_id']),
]
val_dataloader = dict(
batch_size=8,
num_workers=8,
dataset=dict(
type='FlamingoEvalCOCOCaption',
data_root='data/coco',
ann_file='annotations/captions_train2014.json',
data_prefix=dict(img_path='train2014'),
pipeline=test_pipeline,
num_shots=2,
num_support_examples=2048,
num_query_examples=5000,
),
sampler=dict(type='DefaultSampler', shuffle=False),
persistent_workers=True,
)
val_evaluator = dict(
type='COCOCaption',
ann_file='data/coco/annotations/captions_train2014.json')
# If you want standard test, please manually configure the test dataset
test_dataloader = val_dataloader
test_evaluator = val_evaluator
# schedule settings
val_cfg = dict()
test_cfg = dict()
_base_ = [
'../_base_/default_runtime.py',
]
# model settings
model = dict(
type='Flamingo',
tokenizer=dict(
type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
vision_encoder=dict(
type='VisionTransformer',
arch='l',
patch_size=14,
pre_norm=True,
norm_cfg=dict(type='LN', eps=1e-5),
layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
final_norm=False,
out_type='raw',
pretrained=(
'https://download.openmmlab.com/mmclassification/v0/clip/'
'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
),
lang_encoder=dict(
base=dict(
type='AutoModelForCausalLM',
name_or_path='decapoda-research/llama-7b-hf',
local_files_only=True),
adapter=dict(
type='FlamingoLMAdapter',
vis_hidden_size=1024,
cross_attn_every_n_layers=4,
use_media_placement_augmentation=False),
),
task='vqa',
shot_prompt_tmpl=
'<image>Question:{question} Short Answer:{answer}<|endofchunk|>',
final_prompt_tmpl='<image>Question:{question} Short Answer:',
generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
# data settings
data_preprocessor = dict(
mean=[122.770938, 116.7460125, 104.09373615],
std=[68.5005327, 66.6321579, 70.32316305],
to_rgb=True,
)
test_pipeline = [
dict(
type='ApplyToList',
# Flamingo requires to load multiple images during few-shot inference.
scatter_key='img_path',
transforms=[
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=224,
interpolation='bicubic',
backend='pillow'),
dict(type='CenterCrop', crop_size=(224, 224)),
],
collate_keys=['img', 'scale_factor', 'ori_shape'],
),
dict(
type='PackInputs',
algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
meta_keys=['image_id']),
]
val_dataloader = dict(
batch_size=8,
num_workers=8,
dataset=dict(
type='FlamingoEvalCOCOVQA',
data_root='data/coco',
data_prefix='val2014',
question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
ann_file='annotations/v2_mscoco_val2014_annotations.json',
pipeline=test_pipeline,
num_shots=2,
num_support_examples=2048,
num_query_examples=5000,
),
sampler=dict(type='DefaultSampler', shuffle=False),
persistent_workers=True,
)
val_evaluator = dict(type='VQAAcc')
test_dataloader = dict(
batch_size=8,
num_workers=8,
dataset=dict(
type='FlamingoEvalCOCOVQA',
data_root='data/coco',
data_prefix='test2015',
question_file=
'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
pipeline=test_pipeline,
num_shots=0,
num_support_examples=2048,
num_query_examples=5000,
),
sampler=dict(type='DefaultSampler', shuffle=False),
persistent_workers=True,
)
test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
# schedule settings
val_cfg = dict()
test_cfg = dict()
_base_ = [
'../_base_/default_runtime.py',
]
zeroshot_prompt = (
'Output:A child holding a flowered umbrella and petting a yak.<|endofchunk|>' # noqa: E501
'Output:The child is holding a brush close to his mouth.<|endofchunk|>' # noqa: E501
)
# model settings
model = dict(
type='Flamingo',
tokenizer=dict(
type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
vision_encoder=dict(
type='VisionTransformer',
arch='l',
patch_size=14,
pre_norm=True,
norm_cfg=dict(type='LN', eps=1e-5),
layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
final_norm=False,
out_type='raw',
pretrained=(
'https://download.openmmlab.com/mmclassification/v0/clip/'
'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
),
lang_encoder=dict(
base=dict(
type='AutoModelForCausalLM',
name_or_path='decapoda-research/llama-7b-hf',
local_files_only=True),
adapter=dict(
type='FlamingoLMAdapter',
vis_hidden_size=1024,
cross_attn_every_n_layers=4,
use_media_placement_augmentation=False),
),
task='caption',
zeroshot_prompt=zeroshot_prompt,
final_prompt_tmpl='<image>Output:',
generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0),
)
# data settings
data_preprocessor = dict(
type='MultiModalDataPreprocessor',
mean=[122.770938, 116.7460125, 104.09373615],
std=[68.5005327, 66.6321579, 70.32316305],
to_rgb=True,
)
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=224,
interpolation='bicubic',
backend='pillow'),
dict(type='CenterCrop', crop_size=(224, 224)),
dict(
type='PackInputs',
algorithm_keys=['gt_caption'],
meta_keys=['image_id'],
),
]
val_dataloader = dict(
batch_size=8,
num_workers=8,
dataset=dict(
type='FlamingoEvalCOCOCaption',
data_root='data/coco',
ann_file='annotations/captions_train2014.json',
data_prefix=dict(img_path='train2014'),
pipeline=test_pipeline,
num_shots=0,
num_support_examples=2048,
num_query_examples=5000,
),
sampler=dict(type='DefaultSampler', shuffle=False),
persistent_workers=True,
)
val_evaluator = dict(
type='COCOCaption',
ann_file='data/coco/annotations/captions_train2014.json')
# If you want standard test, please manually configure the test dataset
test_dataloader = val_dataloader
test_evaluator = val_evaluator
# schedule settings
val_cfg = dict()
test_cfg = dict()
_base_ = [
'../_base_/default_runtime.py',
]
zeroshot_prompt = (
'Question:What is this photo taken looking through? Short Answer:pitcher<|endofchunk|>' # noqa: E501
'Question:How many people are wearing shorts in the forefront of this photo? Short Answer:4<|endofchunk|>' # noqa: E501
)
# model settings
model = dict(
type='Flamingo',
tokenizer=dict(
type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
vision_encoder=dict(
type='VisionTransformer',
arch='l',
patch_size=14,
pre_norm=True,
norm_cfg=dict(type='LN', eps=1e-5),
layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
final_norm=False,
out_type='raw',
pretrained=(
'https://download.openmmlab.com/mmclassification/v0/clip/'
'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
),
lang_encoder=dict(
base=dict(
type='AutoModelForCausalLM',
name_or_path='decapoda-research/llama-7b-hf',
local_files_only=True),
adapter=dict(
type='FlamingoLMAdapter',
vis_hidden_size=1024,
cross_attn_every_n_layers=4,
use_media_placement_augmentation=False),
),
task='vqa',
zeroshot_prompt=zeroshot_prompt,
final_prompt_tmpl='<image>Question:{question} Short Answer:',
generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
# data settings
data_preprocessor = dict(
type='MultiModalDataPreprocessor',
mean=[122.770938, 116.7460125, 104.09373615],
std=[68.5005327, 66.6321579, 70.32316305],
to_rgb=True,
)
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=224,
interpolation='bicubic',
backend='pillow'),
dict(type='CenterCrop', crop_size=(224, 224)),
dict(
type='PackInputs',
algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
meta_keys=['image_id'],
),
]
val_dataloader = dict(
batch_size=8,
num_workers=8,
dataset=dict(
type='FlamingoEvalCOCOVQA',
data_root='data/coco',
data_prefix='val2014',
question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
ann_file='annotations/v2_mscoco_val2014_annotations.json',
pipeline=test_pipeline,
num_shots=0,
num_support_examples=2048,
num_query_examples=5000,
),
sampler=dict(type='DefaultSampler', shuffle=False),
persistent_workers=True,
)
val_evaluator = dict(type='VQAAcc')
test_dataloader = dict(
batch_size=8,
num_workers=8,
dataset=dict(
type='FlamingoEvalCOCOVQA',
data_root='data/coco',
data_prefix='test2015',
question_file=
'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
pipeline=test_pipeline,
num_shots=0,
num_support_examples=2048,
num_query_examples=5000,
),
sampler=dict(type='DefaultSampler', shuffle=False),
persistent_workers=True,
)
test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
# schedule settings
val_cfg = dict()
test_cfg = dict()
Collections:
- Name: Flamingo
Metadata:
Architecture:
- Transformer
- Gated Cross-Attention Dense
Paper:
Title: 'Flamingo: a Visual Language Model for Few-Shot Learning'
URL: https://arxiv.org/abs/2204.14198
README: configs/flamingo/README.md
Models:
- Name: flamingo_3rdparty-zeroshot_caption
Metadata:
FLOPs: null
Parameters: 8220452880
In Collection: Flamingo
Results:
- Task: Image Caption
Dataset: COCO
Metrics:
CIDER: 65.50 # Report from the official repo
Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
Config: configs/flamingo/flamingo_zeroshot_caption.py
Converted From:
Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
Code: https://github.com/mlfoundations/open_flamingo
- Name: flamingo_3rdparty-zeroshot_vqa
Metadata:
FLOPs: null
Parameters: 8220452880
In Collection: Flamingo
Results:
- Task: Visual Question Answering
Dataset: VQAv2
Metrics:
Accuracy: 43.50 # Report from the official repo
Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
Config: configs/flamingo/flamingo_zeroshot_vqa.py
Converted From:
Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
Code: https://github.com/mlfoundations/open_flamingo
# GLIP
> [Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857)
<!-- [ALGORITHM] -->
## Abstract
This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
<div align="center">
<img src="https://github.com/microsoft/GLIP/blob/main/docs/lead.png" width="70%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('swin-t_glip-pre_3rdparty', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
<!-- [TABS-END] -->
## Results and models
### Pre-trained models
The pre-trained models are used to fine-tune, and therefore don't have evaluation results.
| Model | Pretrain | resolution | Download |
| :------------------------------------------ | :------------------------: | :--------: | :-------------------------------------------------------------------------------------------------------------------: |
| GLIP-T (`swin-t_glip-pre_3rdparty`)\* | O365,GoldG,CC3M,SBU | 224x224 | [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth) |
| GLIP-L (`swin-l_glip-pre_3rdparty_384px`)\* | FourODs,GoldG,CC3M+12M,SBU | 384x384 | [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth) |
*Models with * are converted from the [official repo](https://github.com/microsoft/GLIP).*
## Citation
```bibtex
@inproceedings{li2021grounded,
title={Grounded Language-Image Pre-training},
author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
year={2022},
booktitle={CVPR},
}
```
model = dict(
type='ImageClassifier',
backbone=dict(
type='SwinTransformer',
arch='large',
img_size=384,
out_indices=(1, 2, 3), # original weight is for detection
stage_cfgs=dict(block_cfgs=dict(window_size=12))),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[103.53, 116.28, 123.675],
std=[57.375, 57.12, 58.395],
# convert image from BGR to RGB
to_rgb=False,
)
model = dict(
type='ImageClassifier',
backbone=dict(
type='SwinTransformer',
arch='tiny',
img_size=224,
out_indices=(1, 2, 3), # original weight is for detection
),
neck=None,
head=None)
data_preprocessor = dict(
# RGB format normalization parameters
mean=[103.53, 116.28, 123.675],
std=[57.375, 57.12, 58.395],
# convert image from BGR to RGB
to_rgb=False,
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment