Commit dff2c686 authored by renzhc's avatar renzhc
Browse files

first commit

parent 8f9dd0ed
Pipeline #1665 canceled with stages
Collections:
- Name: ArcFace
Metadata:
Training Data: InShop
Architecture:
- Additive Angular Margin Loss
Paper:
URL: https://arxiv.org/abs/1801.07698
Title: 'ArcFace: Additive Angular Margin Loss for Deep Face Recognition'
README: configs/arcface/README.md
Code:
Version: v1.0.0rc3
URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/heads/margin_head.py
Models:
- Name: resnet50-arcface_inshop
Metadata:
FLOPs: 16571226112
Parameters: 31693888
In Collection: ArcFace
Results:
- Dataset: InShop
Metrics:
Recall@1: 90.18
mAP@10: 69.30
Task: Image Retrieval
Weights: https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth
Config: configs/arcface/resnet50-arcface_8xb32_inshop.py
_base_ = [
'../_base_/datasets/inshop_bs32_448.py',
'../_base_/schedules/cub_bs64.py',
'../_base_/default_runtime.py',
]
pretrained = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth' # noqa
model = dict(
type='ImageToImageRetriever',
image_encoder=[
dict(
type='ResNet',
depth=50,
init_cfg=dict(
type='Pretrained', checkpoint=pretrained, prefix='backbone')),
dict(type='GlobalAveragePooling'),
],
head=dict(
type='ArcFaceClsHead',
num_classes=3997,
in_channels=2048,
loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
init_cfg=None),
prototype={{_base_.gallery_dataloader}})
# runtime settings
default_hooks = dict(
# log every 20 intervals
logger=dict(type='LoggerHook', interval=20),
# save last three checkpoints
checkpoint=dict(
type='CheckpointHook',
save_best='auto',
interval=1,
max_keep_ckpts=3,
rule='greater'))
# optimizer
optim_wrapper = dict(
optimizer=dict(
type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0005, nesterov=True))
# learning policy
param_scheduler = [
# warm up learning rate scheduler
dict(
type='LinearLR',
start_factor=0.01,
by_epoch=True,
begin=0,
end=5,
# update by iter
convert_to_iter_based=True),
# main learning rate scheduler
dict(
type='CosineAnnealingLR',
T_max=45,
by_epoch=True,
begin=5,
end=50,
)
]
train_cfg = dict(by_epoch=True, max_epochs=50, val_interval=1)
auto_scale_lr = dict(enable=True, base_batch_size=256)
custom_hooks = [
dict(type='PrepareProtoBeforeValLoopHook'),
dict(type='SyncBuffersHook')
]
# BarlowTwins
> [Barlow Twins: Self-Supervised Learning via Redundancy Reduction](https://arxiv.org/abs/2103.03230)
<!-- [ALGORITHM] -->
## Abstract
Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.
<div align=center>
<img src="https://user-images.githubusercontent.com/36138628/163914714-082de804-0b5f-4024-94f9-880e6ef334fa.png" width="800" />
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Predict image**
```python
from mmpretrain import inference_model
predict = inference_model('resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('barlowtwins_resnet50_8xb256-coslr-300e_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
**Train/Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Train:
```shell
python tools/train.py configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
```
Test:
```shell
python tools/test.py configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
```
<!-- [TABS-END] -->
## Models and results
### Pretrained models
| Model | Params (M) | Flops (G) | Config | Download |
| :-------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :------------------------------------------------------------------------------: |
| `barlowtwins_resnet50_8xb256-coslr-300e_in1k` | 174.54 | 4.11 | [config](barlowtwins_resnet50_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.json) |
### Image Classification on ImageNet-1k
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Config | Download |
| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
| `resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k` | [BARLOWTWINS](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) | 25.56 | 4.11 | 71.80 | [config](benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.json) |
## Citation
```bibtex
@inproceedings{zbontar2021barlow,
title={Barlow twins: Self-supervised learning via redundancy reduction},
author={Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St{\'e}phane},
booktitle={International Conference on Machine Learning},
year={2021},
}
```
_base_ = [
'../_base_/datasets/imagenet_bs32_byol.py',
'../_base_/default_runtime.py',
]
# datasets
train_dataloader = dict(batch_size=256)
# model settings
model = dict(
type='BarlowTwins',
backbone=dict(
type='ResNet',
depth=50,
norm_cfg=dict(type='SyncBN'),
zero_init_residual=True),
neck=dict(
type='NonLinearNeck',
in_channels=2048,
hid_channels=8192,
out_channels=8192,
num_layers=3,
with_last_bn=False,
with_last_bn_affine=False,
with_avg_pool=True,
init_cfg=dict(
type='Kaiming', distribution='uniform', layer=['Linear'])),
head=dict(
type='LatentCrossCorrelationHead',
in_channels=8192,
loss=dict(type='CrossCorrelationLoss')))
# optimizer
optim_wrapper = dict(
type='OptimWrapper',
optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
paramwise_cfg=dict(
custom_keys={
'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
# bn layer in ResNet block downsample module
'downsample.1': dict(
decay_mult=0, lr_mult=0.024, lars_exclude=True),
}))
# learning rate scheduler
param_scheduler = [
dict(
type='LinearLR',
start_factor=1.6e-4,
by_epoch=True,
begin=0,
end=10,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
T_max=990,
eta_min=0.0016,
by_epoch=True,
begin=10,
end=1000,
convert_to_iter_based=True)
]
# runtime settings
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1000)
default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
auto_scale_lr = dict(base_batch_size=2048)
_base_ = [
'../_base_/datasets/imagenet_bs32_byol.py',
'../_base_/default_runtime.py',
]
# datasets
train_dataloader = dict(batch_size=256)
# model settings
model = dict(
type='BarlowTwins',
backbone=dict(
type='ResNet',
depth=50,
norm_cfg=dict(type='SyncBN'),
zero_init_residual=True),
neck=dict(
type='NonLinearNeck',
in_channels=2048,
hid_channels=8192,
out_channels=8192,
num_layers=3,
with_last_bn=False,
with_last_bn_affine=False,
with_avg_pool=True,
init_cfg=dict(
type='Kaiming', distribution='uniform', layer=['Linear'])),
head=dict(
type='LatentCrossCorrelationHead',
in_channels=8192,
loss=dict(type='CrossCorrelationLoss')))
# optimizer
optim_wrapper = dict(
type='OptimWrapper',
optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
paramwise_cfg=dict(
custom_keys={
'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
# bn layer in ResNet block downsample module
'downsample.1': dict(
decay_mult=0, lr_mult=0.024, lars_exclude=True),
}))
# learning rate scheduler
param_scheduler = [
dict(
type='LinearLR',
start_factor=1.6e-4,
by_epoch=True,
begin=0,
end=10,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
T_max=290,
eta_min=0.0016,
by_epoch=True,
begin=10,
end=300,
convert_to_iter_based=True)
]
# runtime settings
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
auto_scale_lr = dict(base_batch_size=2048)
_base_ = [
'../../_base_/models/resnet50.py',
'../../_base_/datasets/imagenet_bs32_pil_resize.py',
'../../_base_/schedules/imagenet_sgd_coslr_100e.py',
'../../_base_/default_runtime.py',
]
model = dict(
backbone=dict(
frozen_stages=4,
init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
# runtime settings
default_hooks = dict(
checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
Collections:
- Name: BarlowTwins
Metadata:
Training Data: ImageNet-1k
Training Techniques:
- LARS
Training Resources: 8x A100 GPUs
Architecture:
- ResNet
- BarlowTwins
Paper:
Title: 'Barlow Twins: Self-Supervised Learning via Redundancy Reduction'
URL: https://arxiv.org/abs/2103.03230
README: configs/barlowtwins/README.md
Models:
- Name: barlowtwins_resnet50_8xb256-coslr-300e_in1k
Metadata:
Epochs: 300
Batch Size: 2048
FLOPs: 4109364224
Parameters: 174535744
Training Data: ImageNet-1k
In Collection: BarlowTwins
Results: null
Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth
Config: configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
Downstream:
- resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
- Name: resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
Metadata:
Epochs: 100
Batch Size: 256
FLOPs: 4109464576
Parameters: 25557032
Training Data: ImageNet-1k
In Collection: BarlowTwins
Results:
- Task: Image Classification
Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 71.8
Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
Config: configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
# BEiT
> [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254)
<!-- [ALGORITHM] -->
## Abstract
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
<div align=center>
<img src="https://user-images.githubusercontent.com/36138628/203688351-adac7146-4e71-4ab6-8958-5cfe643a2dc5.png" width="70%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Predict image**
```python
from mmpretrain import inference_model
predict = inference_model('beit-base-p16_beit-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('beit_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
**Train/Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Train:
```shell
python tools/train.py configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
```
Test:
```shell
python tools/test.py configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
```
<!-- [TABS-END] -->
## Models and results
### Pretrained models
| Model | Params (M) | Flops (G) | Config | Download |
| :---------------------------------------------- | :--------: | :-------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------: |
| `beit_beit-base-p16_8xb256-amp-coslr-300e_in1k` | 86.53 | 17.58 | [config](beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.json) |
### Image Classification on ImageNet-1k
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
| `beit-base-p16_beit-pre_8xb128-coslr-100e_in1k` | [BEIT](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) | 86.53 | 17.58 | 83.10 | N/A | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.json) |
| `beit-base-p16_beit-in21k-pre_3rdparty_in1k`\* | BEIT ImageNet-21k | 86.53 | 17.58 | 85.28 | 97.59 | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth) |
*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@inproceedings{bao2022beit,
title={{BE}iT: {BERT} Pre-Training of Image Transformers},
author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
booktitle={International Conference on Learning Representations},
year={2022},
}
```
_base_ = '../_base_/default_runtime.py'
# dataset settings
dataset_type = 'ImageNet'
data_root = 'data/imagenet/'
data_preprocessor = dict(
type='TwoNormDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
second_mean=[-31.875, -31.875, -31.875],
second_std=[318.75, 318.75, 318.75],
to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.4,
hue=0.),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
dict(
type='RandomResizedCropAndInterpolationWithTwoPic',
size=224,
second_size=112,
interpolation='bicubic',
second_interpolation='lanczos',
scale=(0.08, 1.0)),
dict(
type='BEiTMaskGenerator',
input_size=(14, 14),
num_masking_patches=75,
max_num_patches=None,
min_num_patches=16),
dict(type='PackInputs')
]
train_dataloader = dict(
batch_size=256,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
collate_fn=dict(type='default_collate'),
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='meta/train.txt',
data_prefix=dict(img_path='train/'),
pipeline=train_pipeline))
# model settings
model = dict(
type='BEiT',
backbone=dict(
type='BEiTPretrainViT',
arch='base',
patch_size=16,
drop_path_rate=0.1,
final_norm=True,
out_type='raw',
layer_scale_init_value=0.1,
init_cfg=[
dict(type='TruncNormal', std=0.02, layer='Linear'),
dict(type='TruncNormal', std=0.02, layer='Conv2d'),
dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
]),
neck=None,
head=dict(
type='BEiTV1Head',
embed_dims=768,
num_embed=8192,
loss=dict(type='CrossEntropyLoss')),
target_generator=dict(
type='DALL-E',
init_cfg=dict(
type='Pretrained',
checkpoint= # noqa: E251
'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/dalle_encoder.pth', # noqa: E501
)))
# optimizer wrapper
optim_wrapper = dict(
type='AmpOptimWrapper',
loss_scale='dynamic',
optimizer=dict(
type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
clip_grad=dict(max_norm=3.0),
paramwise_cfg=dict(
custom_keys={
# the following configurations are designed for BEiT
'.ln': dict(decay_mult=0.0),
'.bias': dict(decay_mult=0.0),
'q_bias': dict(decay_mult=0.0),
'v_bias': dict(decay_mult=0.0),
'.cls_token': dict(decay_mult=0.0),
'.pos_embed': dict(decay_mult=0.0),
'.gamma': dict(decay_mult=0.0),
}))
# learning rate scheduler
param_scheduler = [
dict(
type='LinearLR',
start_factor=1e-4,
by_epoch=True,
begin=0,
end=10,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
eta_min=1e-5,
by_epoch=True,
begin=10,
end=300,
convert_to_iter_based=True)
]
# runtime settings
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
default_hooks = dict(
# only keeps the latest 3 checkpoints
checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
randomness = dict(seed=0, diff_rank_seed=True)
find_unused_parameters = True
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
auto_scale_lr = dict(base_batch_size=2048)
_base_ = [
'../../_base_/datasets/imagenet_bs64_swin_224.py',
'../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../../_base_/default_runtime.py'
]
data_preprocessor = dict(
num_classes=1000,
mean=[127.5, 127.5, 127.5],
std=[127.5, 127.5, 127.5],
to_rgb=True,
)
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='BEiTViT',
arch='base',
img_size=224,
patch_size=16,
drop_path_rate=0.1,
out_type='avg_featmap',
use_abs_pos_emb=False,
use_rel_pos_bias=True,
use_shared_rel_pos_bias=False,
init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=768,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='RandomResizedCrop',
scale=224,
backend='pillow',
interpolation='bicubic'),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
dict(
type='RandAugment',
policies='timm_increasing',
num_policies=2,
total_level=10,
magnitude_level=9,
magnitude_std=0.5,
hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
dict(
type='RandomErasing',
erase_prob=0.25,
mode='rand',
min_area_ratio=0.02,
max_area_ratio=0.3333333333333333,
fill_color=[103.53, 116.28, 123.675],
fill_std=[57.375, 57.12, 58.395]),
dict(type='PackInputs')
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=256,
edge='short',
backend='pillow',
interpolation='bicubic'),
dict(type='CenterCrop', crop_size=224),
dict(type='PackInputs')
]
train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
# optimizer wrapper
optim_wrapper = dict(
optimizer=dict(
type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
constructor='LearningRateDecayOptimWrapperConstructor',
paramwise_cfg=dict(
_delete_=True,
layer_decay_rate=0.65,
custom_keys={
# the following configurations are designed for BEiT
'.ln': dict(decay_mult=0.0),
'.bias': dict(decay_mult=0.0),
'q_bias': dict(decay_mult=0.0),
'v_bias': dict(decay_mult=0.0),
'.cls_token': dict(decay_mult=0.0),
'.pos_embed': dict(decay_mult=0.0),
'.gamma': dict(decay_mult=0.0),
}))
# learning rate scheduler
param_scheduler = [
dict(
type='LinearLR',
start_factor=1e-4,
by_epoch=True,
begin=0,
end=20,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
by_epoch=True,
begin=20,
end=100,
eta_min=1e-6,
convert_to_iter_based=True)
]
# runtime settings
default_hooks = dict(
# save checkpoint per epoch.
checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
train_cfg = dict(by_epoch=True, max_epochs=100)
randomness = dict(seed=0)
_base_ = [
'../../_base_/datasets/imagenet_bs64_swin_224.py',
'../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../../_base_/default_runtime.py'
]
data_preprocessor = dict(
num_classes=1000,
# RGB format normalization parameters
mean=[127.5, 127.5, 127.5],
std=[127.5, 127.5, 127.5],
# convert image from BGR to RGB
to_rgb=True,
)
model = dict(
type='ImageClassifier',
backbone=dict(
type='BEiTViT',
arch='base',
img_size=224,
patch_size=16,
out_type='avg_featmap',
use_abs_pos_emb=False,
use_rel_pos_bias=True,
use_shared_rel_pos_bias=False,
),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=768,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
Collections:
- Name: BEiT
Metadata:
Architecture:
- Attention Dropout
- Convolution
- Dense Connections
- Dropout
- GELU
- Layer Normalization
- Multi-Head Attention
- Scaled Dot-Product Attention
- Tanh Activation
Paper:
Title: 'BEiT: BERT Pre-Training of Image Transformers'
URL: https://arxiv.org/abs/2106.08254
README: configs/beit/README.md
Code:
URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
Version: v1.0.0rc4
Models:
- Name: beit_beit-base-p16_8xb256-amp-coslr-300e_in1k
Metadata:
Epochs: 300
Batch Size: 2048
FLOPs: 17581219584
Parameters: 86530984
Training Data: ImageNet-1k
In Collection: BEiT
Results: null
Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth
Config: configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
Downstream:
- beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
- Name: beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
Metadata:
Epochs: 100
Batch Size: 1024
FLOPs: 17581219584
Parameters: 86530984
Training Data: ImageNet-1k
In Collection: BEiT
Results:
- Task: Image Classification
Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 83.1
Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
Config: configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
- Name: beit-base-p16_beit-in21k-pre_3rdparty_in1k
Metadata:
FLOPs: 17581219584
Parameters: 86530984
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: BEiT
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 85.28
Top 5 Accuracy: 97.59
Weights: https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth
Config: configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
Converted From:
Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_base_patch16_224_pt22k_ft22kto1k.pth
Code: https://github.com/microsoft/unilm/tree/master/beit
# BEiTv2
> [BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers](https://arxiv.org/abs/2208.06366)
<!-- [ALGORITHM] -->
## Abstract
Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation.
<div align=center>
<img src="https://user-images.githubusercontent.com/36138628/203912182-5967a520-d455-49ea-bc67-dcbd500d76bf.png" width="70%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Predict image**
```python
from mmpretrain import inference_model
predict = inference_model('beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```
**Use the model**
```python
import torch
from mmpretrain import get_model
model = get_model('beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```
**Train/Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Train:
```shell
python tools/train.py configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
```
Test:
```shell
python tools/test.py configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
```
<!-- [TABS-END] -->
## Models and results
### Pretrained models
| Model | Params (M) | Flops (G) | Config | Download |
| :------------------------------------------------ | :--------: | :-------: | :----------------------------------------------------------: | :----------------------------------------------------------------------: |
| `beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k` | 192.81 | 17.58 | [config](beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.json) |
### Image Classification on ImageNet-1k
| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
| `beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k` | [BEITV2](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) | 86.53 | 17.58 | 85.00 | N/A | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.json) |
| `beit-base-p16_beitv2-in21k-pre_3rdparty_in1k`\* | BEITV2 ImageNet-21k | 86.53 | 17.58 | 86.47 | 97.99 | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth) |
*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit2). The config files of these models are only for inference. We haven't reproduce the training results.*
## Citation
```bibtex
@article{beitv2,
title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
year={2022},
eprint={2208.06366},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
_base_ = [
'../_base_/datasets/imagenet_bs256_beitv2.py',
'../_base_/default_runtime.py',
]
# model settings
vqkd_encoder = dict(
arch='base',
img_size=224,
patch_size=16,
in_channels=3,
out_indices=-1,
drop_rate=0.,
drop_path_rate=0.,
norm_cfg=dict(type='LN', eps=1e-6),
final_norm=True,
out_type='featmap',
with_cls_token=True,
frozen_stages=-1,
use_abs_pos_emb=True,
use_rel_pos_bias=False,
use_shared_rel_pos_bias=False,
layer_scale_init_value=0.,
interpolate_mode='bicubic',
patch_cfg=dict(),
layer_cfgs=dict(),
init_cfg=None)
layer_scale_init_value = 0.1
drop_path_rate = 0.1 # 0. for 300 epochs and 0.1 for 1600 epochs.
model = dict(
type='BEiT',
backbone=dict(
type='BEiTPretrainViT',
arch='base',
patch_size=16,
out_indices=[-4, -1],
drop_path_rate=drop_path_rate,
final_norm=False,
out_type='raw',
layer_scale_init_value=layer_scale_init_value,
init_cfg=[
dict(type='TruncNormal', std=0.02, layer='Linear'),
dict(type='TruncNormal', std=0.02, layer='Conv2d'),
dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
]),
neck=dict(
type='BEiTV2Neck',
num_layers=2,
early_layers=9,
backbone_arch='base',
drop_path_rate=drop_path_rate,
layer_scale_init_value=layer_scale_init_value,
),
head=dict(
type='BEiTV2Head',
embed_dims=768,
num_embed=8192,
loss=dict(type='CrossEntropyLoss')),
target_generator=dict(
type='VQKD',
encoder_config=vqkd_encoder,
init_cfg=dict(
type='Pretrained',
checkpoint= # noqa
'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth' # noqa
)))
# optimizer wrapper
optim_wrapper = dict(
type='AmpOptimWrapper',
loss_scale='dynamic',
# betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
optimizer=dict(
type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
clip_grad=dict(max_norm=3.0),
paramwise_cfg=dict(
custom_keys={
# the following configurations are designed for BEiT
'.ln': dict(decay_mult=0.0),
'.bias': dict(decay_mult=0.0),
'q_bias': dict(decay_mult=0.0),
'v_bias': dict(decay_mult=0.0),
'.cls_token': dict(decay_mult=0.0),
'.pos_embed': dict(decay_mult=0.0),
'.gamma': dict(decay_mult=0.0),
}))
# learning rate scheduler
param_scheduler = [
dict(
type='LinearLR',
start_factor=1e-4,
by_epoch=True,
begin=0,
end=10,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
eta_min=1e-5,
by_epoch=True,
begin=10,
end=1600,
convert_to_iter_based=True)
]
# runtime settings
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
default_hooks = dict(
# only keeps the latest 3 checkpoints
checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
randomness = dict(seed=0, diff_rank_seed=True)
find_unused_parameters = True
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
auto_scale_lr = dict(base_batch_size=2048)
_base_ = [
'../_base_/datasets/imagenet_bs256_beitv2.py',
'../_base_/default_runtime.py',
]
# model settings
vqkd_encoder = dict(
arch='base',
img_size=224,
patch_size=16,
in_channels=3,
out_indices=-1,
drop_rate=0.,
drop_path_rate=0.,
norm_cfg=dict(type='LN', eps=1e-6),
final_norm=True,
out_type='featmap',
with_cls_token=True,
frozen_stages=-1,
use_abs_pos_emb=True,
use_rel_pos_bias=False,
use_shared_rel_pos_bias=False,
layer_scale_init_value=0.,
interpolate_mode='bicubic',
patch_cfg=dict(),
layer_cfgs=dict(),
init_cfg=None)
layer_scale_init_value = 0.1
drop_path_rate = 0. # 0. for 300 epochs and 0.1 for 1600 epochs.
model = dict(
type='BEiT',
backbone=dict(
type='BEiTPretrainViT',
arch='base',
patch_size=16,
out_indices=[-4, -1],
drop_path_rate=drop_path_rate,
final_norm=False,
out_type='raw',
layer_scale_init_value=layer_scale_init_value,
init_cfg=[
dict(type='TruncNormal', std=0.02, layer='Linear'),
dict(type='TruncNormal', std=0.02, layer='Conv2d'),
dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
]),
neck=dict(
type='BEiTV2Neck',
num_layers=2,
early_layers=9,
backbone_arch='base',
drop_path_rate=drop_path_rate,
layer_scale_init_value=layer_scale_init_value,
),
head=dict(
type='BEiTV2Head',
embed_dims=768,
num_embed=8192,
loss=dict(type='CrossEntropyLoss')),
target_generator=dict(
type='VQKD',
encoder_config=vqkd_encoder,
init_cfg=dict(
type='Pretrained',
checkpoint= # noqa
'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth' # noqa
)))
# optimizer wrapper
optim_wrapper = dict(
type='AmpOptimWrapper',
loss_scale='dynamic',
# betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
optimizer=dict(
type='AdamW', lr=1.5e-3, betas=(0.9, 0.98), weight_decay=0.05),
clip_grad=dict(max_norm=3.0),
paramwise_cfg=dict(
custom_keys={
# the following configurations are designed for BEiT
'.ln': dict(decay_mult=0.0),
'.bias': dict(decay_mult=0.0),
'q_bias': dict(decay_mult=0.0),
'v_bias': dict(decay_mult=0.0),
'.cls_token': dict(decay_mult=0.0),
'.pos_embed': dict(decay_mult=0.0),
'.gamma': dict(decay_mult=0.0),
}))
# learning rate scheduler
param_scheduler = [
dict(
type='LinearLR',
start_factor=1e-4,
by_epoch=True,
begin=0,
end=10,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
eta_min=1e-5,
by_epoch=True,
begin=10,
end=300,
convert_to_iter_based=True)
]
# runtime settings
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
default_hooks = dict(
# only keeps the latest 3 checkpoints
checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
randomness = dict(seed=0, diff_rank_seed=True)
find_unused_parameters = True
# NOTE: `auto_scale_lr` is for automatically scaling LR
# based on the actual training batch size.
auto_scale_lr = dict(base_batch_size=2048)
_base_ = [
'../../_base_/datasets/imagenet_bs64_swin_224.py',
'../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../../_base_/default_runtime.py'
]
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='BEiTViT',
arch='base',
img_size=224,
patch_size=16,
# 0.2 for 1600 epochs pretrained models and 0.1 for 300 epochs.
drop_path_rate=0.1,
out_type='avg_featmap',
use_abs_pos_emb=False,
use_rel_pos_bias=True,
use_shared_rel_pos_bias=False,
init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=768,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='RandomResizedCrop',
scale=224,
backend='pillow',
interpolation='bicubic'),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
dict(
type='RandAugment',
policies='timm_increasing',
num_policies=2,
total_level=10,
magnitude_level=9,
magnitude_std=0.5,
hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
dict(
type='RandomErasing',
erase_prob=0.25,
mode='rand',
min_area_ratio=0.02,
max_area_ratio=0.3333333333333333,
fill_color=[103.53, 116.28, 123.675],
fill_std=[57.375, 57.12, 58.395]),
dict(type='PackInputs')
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=256,
edge='short',
backend='pillow',
interpolation='bicubic'),
dict(type='CenterCrop', crop_size=224),
dict(type='PackInputs')
]
train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
# optimizer wrapper
optim_wrapper = dict(
optimizer=dict(
type='AdamW', lr=5e-4, weight_decay=0.05, betas=(0.9, 0.999)),
constructor='LearningRateDecayOptimWrapperConstructor',
paramwise_cfg=dict(
_delete_=True,
# 0.6 for 1600 epochs pretrained models and 0.65 for 300 epochs
layer_decay_rate=0.65,
custom_keys={
# the following configurations are designed for BEiT
'.ln': dict(decay_mult=0.0),
'.bias': dict(decay_mult=0.0),
'q_bias': dict(decay_mult=0.0),
'v_bias': dict(decay_mult=0.0),
'.cls_token': dict(decay_mult=0.0),
'.pos_embed': dict(decay_mult=0.0),
'.gamma': dict(decay_mult=0.0),
}))
# learning rate scheduler
param_scheduler = [
dict(
type='LinearLR',
start_factor=1e-4,
by_epoch=True,
begin=0,
end=20,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
by_epoch=True,
begin=20,
end=100,
eta_min=1e-6,
convert_to_iter_based=True)
]
# runtime settings
default_hooks = dict(
# save checkpoint per epoch.
checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
train_cfg = dict(by_epoch=True, max_epochs=100)
randomness = dict(seed=0)
_base_ = [
'../../_base_/datasets/imagenet_bs64_swin_224.py',
'../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../../_base_/default_runtime.py'
]
model = dict(
type='ImageClassifier',
backbone=dict(
type='BEiTViT',
arch='base',
img_size=224,
patch_size=16,
out_type='avg_featmap',
use_abs_pos_emb=False,
use_rel_pos_bias=True,
use_shared_rel_pos_bias=False,
),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=768,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
Collections:
- Name: BEiTv2
Metadata:
Architecture:
- Attention Dropout
- Convolution
- Dense Connections
- Dropout
- GELU
- Layer Normalization
- Multi-Head Attention
- Scaled Dot-Product Attention
- Tanh Activation
Paper:
Title: 'BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers'
URL: https://arxiv.org/abs/2208.06366
README: configs/beitv2/README.md
Code:
URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
Version: v1.0.0rc4
Models:
- Name: beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k
Metadata:
Epochs: 300
Batch Size: 2048
FLOPs: 17581223424
Parameters: 192811376
Training Data: ImageNet-1k
In Collection: BEiTv2
Results: null
Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth
Config: configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
Downstream:
- beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
- Name: beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
Metadata:
Epochs: 100
Batch Size: 1024
FLOPs: 17581219584
Parameters: 86530984
Training Data: ImageNet-1k
In Collection: BEiTv2
Results:
- Task: Image Classification
Dataset: ImageNet-1k
Metrics:
Top 1 Accuracy: 85.0
Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
Config: configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
- Name: beit-base-p16_beitv2-in21k-pre_3rdparty_in1k
Metadata:
FLOPs: 17581219584
Parameters: 86530984
Training Data:
- ImageNet-21k
- ImageNet-1k
In Collection: BEiTv2
Results:
- Dataset: ImageNet-1k
Task: Image Classification
Metrics:
Top 1 Accuracy: 86.47
Top 5 Accuracy: 97.99
Weights: https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth
Config: configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
Converted From:
Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beitv2/beitv2_base_patch16_224_pt1k_ft21kto1k.pth
Code: https://github.com/microsoft/unilm/tree/master/beit2
# BLIP
> [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086)
<!-- [ALGORITHM] -->
## Abstract
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
<div align=center>
<img src="https://user-images.githubusercontent.com/26739999/236374275-94d2f94b-d9a7-4f12-b694-f15a2be00be6.png" width="90%"/>
</div>
## How to use it?
<!-- [TABS-BEGIN] -->
**Use the model**
```python
from mmpretrain import inference_model
result = inference_model('blip-base_3rdparty_caption', 'demo/cat-dog.png')
print(result)
# {'pred_caption': 'a puppy and a cat sitting on a blanket'}
```
**Test Command**
Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
Test:
```shell
python tools/test.py configs/blip/blip-base_8xb32_caption.py https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth
```
<!-- [TABS-END] -->
## Models and results
### Image Caption on COCO
| Model | Params (M) | BLEU-4 | CIDER | Config | Download |
| :----------------------------- | :--------: | :----: | :----: | :------------------------------------: | :------------------------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_caption`\* | 223.97 | 40.12 | 132.82 | [config](./blip-base_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
### Image Caption on NoCaps
| Model | Params (M) | SPICE | CIDER | Config | Download |
| :----------------------------- | :--------: | :---: | :----: | :-----------------------------------: | :--------------------------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_caption`\* | 223.97 | 14.69 | 109.12 | [config](./blip-base_8xb32_nocaps.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
### Image Caption on Flickr30k
| Model | Params (M) | SPICE | CIDER | Config | Download |
| :----------------------------- | :--------: | :---: | :---: | :----------------------------------------------: | :----------------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_caption`\* | 223.97 | 15.58 | 68.89 | [config](./blip-base_8xb32_caption_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
### Visual Grounding on RefCOCO
| Model | Params (M) | Accuracy (testA) | Accuracy (testB) | Config | Download |
| :------------------------ | :--------: | :--------------: | :--------------: | :----------------------------------: | :-----------------------------------------------------------------------------------------------: |
| `blip-base_8xb16_refcoco` | 498.49 | 86.14 | 77.33 | [config](blip-base_8xb16_refcoco.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.pth) \| [log](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.json) |
### Visual Question Answering on VQAv2
| Model | Params (M) | Accuracy | Config | Download |
| :------------------------- | :--------: | :------: | :--------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_vqa`\* | 361.48 | 78.20 | [config](./blip-base_8xb32_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
### Visual Question Answering on OK-VQA
| Model | Params (M) | Accuracy | Config | Download |
| :------------------------- | :--------: | :------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_vqa`\* | 361.48 | 40.59# | [config](./blip-base_8xb32_okvqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
### Visual Question Answering on OCR-VQA
| Model | Params (M) | Accuracy | Config | Download |
| :------------------------- | :--------: | :------: | :-----------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_vqa`\* | 361.48 | 28.30# | [config](./blip-base_8xb32_ocrvqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
### Image-To-Text Retrieval on COCO
| Model | Params (M) | Recall@1 | Recall@5 | Config | Download |
| :------------------------------- | :--------: | :------: | :------: | :--------------------------------------: | :----------------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_retrieval`\* | 447.49 | 82.52 | 95.34 | [config](./blip-base_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
### Text-To-Image Retrieval on COCO
| Model | Params (M) | Recall@1 | Recall@5 | Config | Download |
| :------------------------------- | :--------: | :------: | :------: | :--------------------------------------: | :----------------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_retrieval`\* | 447.49 | 64.82 | 86.28 | [config](./blip-base_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
### Image-To-Text Retrieval on Flickr30k
| Model | Params (M) | Recall@1 | Recall@5 | Config | Download |
| :------------------------------- | :--------: | :------: | :------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_retrieval`\* | 447.49 | 95.10# | 99.60# | [config](./blip-base_8xb32_retrieval_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
### Text-To-Image Retrieval on Flickr30k
| Model | Params (M) | Recall@1 | Recall@5 | Config | Download |
| :------------------------------- | :--------: | :------: | :------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_retrieval`\* | 447.49 | 85.26# | 96.58# | [config](./blip-base_8xb32_retrieval_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
### NLVR on NLVR2
| Model | Params (M) | Top-1 (%) | Config | Download |
| :-------------------------- | :--------: | :-------: | :---------------------------------: | :------------------------------------------------------------------------------------------------------------: |
| `blip-base_3rdparty_nlvr`\* | 259.37 | 82.33 | [config](./blip-base_8xb32_nlvr.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_nlvr_20230427-3b14d33f.pth) |
*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS). The config files of these models are only for inference. We haven't reproduce the training results.*
*Results with # denote zero-shot evaluation. The corresponding model hasn't been finetuned on that dataset.*
## Citation
```bibtex
@inproceedings{li2022blip,
title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
year={2022},
booktitle={ICML},
}
```
_base_ = [
'../_base_/datasets/refcoco.py',
'../_base_/default_runtime.py',
]
med_config = {
'architectures': ['BertModel'],
'attention_probs_dropout_prob': 0.1,
'hidden_act': 'gelu',
'hidden_dropout_prob': 0.1,
'hidden_size': 768,
'initializer_range': 0.02,
'intermediate_size': 3072,
'layer_norm_eps': 1e-12,
'max_position_embeddings': 512,
'model_type': 'bert',
'num_attention_heads': 12,
'num_hidden_layers': 12,
'pad_token_id': 0,
'add_type_embeddings': False,
'vocab_size': 30524,
'encoder_width': 768,
'add_cross_attention': True
}
model = dict(
type='BlipGrounding',
visual_encoder=dict(
type='VisionTransformer',
arch='b',
img_size=384,
patch_size=16,
out_type='raw',
),
text_encoder=dict(
type='XBertEncoder',
med_config=med_config,
),
multimodal_encoder=dict(
type='XBertEncoder',
med_config=med_config,
),
tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
head=dict(
type='GroundingHead',
decoder=dict(
type='XBertLMHeadDecoder',
med_config=med_config,
),
box_l1_loss_coeff=4.0,
box_giou_loss_coeff=2.0,
),
)
# schedule settings
optimizer = dict(type='AdamW', lr=1.5e-5, weight_decay=0.02)
optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
train_cfg = dict(by_epoch=True, max_epochs=120)
val_cfg = dict()
test_cfg = dict()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment