Commit a245fbd1 authored by chenych's avatar chenych
Browse files

.

parent c501623c
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
# Fine-tuning BEiT-3 on Image Captioning
## COCO Captioning Setup
1. [Setup environment](../README.md#setup).
2. Download [2014 train images](http://images.cocodataset.org/zips/train2014.zip), [2014 val images](http://images.cocodataset.org/zips/val2014.zip) and [karpathy split](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip), then organize the dataset as following structure:
```
/path/to/your_data/
train2014/
COCO_train2014_000000000009.jpg
...
val2014/
COCO_val2014_000000000042.jpg
...
dataset_coco.json
```
We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
```
from datasets import CaptioningDataset
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
CaptioningDataset.make_coco_captioning_dataset_index(
data_path="/path/to/your_data",
tokenizer=tokenizer,
)
```
## NoCaps Setup
1. [Setup environment](README.md#setup).
2. Download [NoCaps val set](https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json), [NoCaps test set](https://s3.amazonaws.com/nocaps/nocaps_test_image_info.json) and download imags using the urls in val and test json files, then organize the dataset as following structure:
```
/path/to/your_data/
val/
09c863d76bcf6b00.jpg
...
test/
19dc6913830a0a21.jpg
...
nocaps_val_4500_captions.json
nocaps_test_image_info.json
```
We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
```
from datasets import CaptioningDataset
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
CaptioningDataset.make_nocaps_captioning_dataset_index(
data_path="/path/to/your_data",
)
```
We use COCO captioning training set as the training data of NoCaps.
## Example: Fine-tuning BEiT-3 on Captioning
The BEiT-3 **base** model can be fine-tuned on captioning tasks using 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_base_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 32 \
--layer_decay 1.0 \
--lr 4e-5 \
--randaug \
--epochs 10 \
--warmup_epochs 1 \
--drop_path 0.1 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.05 \
--seed 42 \
--save_ckpt_freq 5 \
--num_max_bpe_tokens 32 \
--captioning_mask_prob 0.7 \
--drop_worst_after 12000 \
--dist_eval \
--checkpoint_activations \
--enable_deepspeed
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*32 = 256`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models).
- `--task`: **coco_captioning** for COCO captioning and **nocaps** for NoCaps dataset.
- `lr`: 4e-5 for COCO captioning and 1e-5 for NoCaps.
- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory.
The BEiT-3 **large** model can be fine-tuned on captioning tasks using 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_large_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 32 \
--layer_decay 1.0 \
--lr 8e-6 \
--randaug \
--epochs 10 \
--warmup_epochs 1 \
--drop_path 0.1 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.05 \
--seed 42 \
--save_ckpt_freq 5 \
--num_max_bpe_tokens 32 \
--captioning_mask_prob 0.7 \
--drop_worst_after 12000 \
--dist_eval \
--checkpoint_activations \
--enable_deepspeed
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*32 = 256`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models).
- `--task`: **coco_captioning** for COCO captioning and **nocaps** for NoCaps dataset.
- `lr`: 8e-6 for COCO captioning and NoCaps.
- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory.
## Example: Evaluate BEiT-3 Fine-tuned model on Captioning
- Get the prediction file of the fine-tuned BEiT3-base model on captioning with 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_base_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 16 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_480_coco_captioning.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_prediction \
--eval \
--dist_eval
```
- `--task`: **coco_captioning** for COCO captioning and **nocaps** for NoCaps dataset.
- `--finetune`: **beit3_base_patch16_480_coco_captioning.pth** for COCO captioning and **beit3_base_patch16_480_nocaps.pth** for NoCaps dataset.
- Get the prediction file of the fine-tuned BEiT3-large model on captioning with 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_large_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 16 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_480_coco_captioning.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_prediction \
--eval \
--dist_eval
```
- `--task`: **coco_captioning** for COCO captioning and **nocaps** for NoCaps dataset.
- `--finetune`: **beit3_large_patch16_480_coco_captioning.pth** for COCO captioning and **beit3_large_patch16_480_nocaps.pth** for NoCaps dataset.
Please then submit the prediction file in the `output_dir` to the [evaluation server](https://eval.ai/web/challenges/challenge-page/355/overview) to obtain the NoCaps val and test results.
# Fine-tuning BEiT-3 on ImageNet-1k (Image Classification)
## Setup
1. [Setup environment](../README.md#setup).
2. Download and extract ImageNet-1k from http://image-net.org/.
The directory structure is the standard layout of torchvision's [`datasets.ImageFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder). The training and validation data are expected to be in the `train/` folder and `val/` folder, respectively:
```
/path/to/imagenet/
train/
class1/
img1.jpeg
class2/
img2.jpeg
val/
class1/
img3.jpeg
class/2
img4.jpeg
```
We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
```
from datasets import ImageNetDataset
ImageNetDataset.make_dataset_index(
train_data_path = "/path/to/your_data/train",
val_data_path = "/path/to/your_data/val",
index_path = "/path/to/your_data"
)
```
## Example: Fine-tuning BEiT-3 on ImageNet-1k (Image Classification)
The BEiT-3 **base** model can be finetuned on ImageNet-1k using 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_base_patch16_224 \
--task imagenet \
--batch_size 128 \
--layer_decay 0.65 \
--lr 7e-4 \
--update_freq 1 \
--epochs 50 \
--warmup_epochs 5 \
--drop_path 0.15 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.05 \
--seed 42 \
--save_ckpt_freq 5 \
--dist_eval \
--mixup 0.8 \
--cutmix 1.0 \
--enable_deepspeed
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*128*1 = 1024`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
The BEiT-3 **large** model can be finetuned on ImageNet-1k using a DGX box (8 V100-32GB):
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_large_patch16_224 \
--task imagenet \
--batch_size 128 \
--layer_decay 0.8 \
--lr 2e-4 \
--update_freq 1 \
--epochs 50 \
--warmup_epochs 5 \
--drop_path 0.25 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.05 \
--seed 42 \
--save_ckpt_freq 5 \
--dist_eval \
--mixup 0.8 \
--cutmix 1.0 \
--enable_deepspeed \
--checkpoint_activations
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*128 = 1024`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory
## Example: Evaluate BEiT-3 Finetuned model on ImageNet-1k (Image Classification)
- Evaluate our fine-tuned BEiT3-base model on ImageNet val with a single GPU:
```bash
python -m torch.distributed.launch --nproc_per_node=1 run_beit3_finetuning.py \
--model beit3_base_patch16_224 \
--task imagenet \
--batch_size 128 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_224_in1k.pth \
--data_path /path/to/your_data \
--eval \
--dist_eval
```
Expected results:
```
* Acc@1 85.400 Acc@5 97.630
```
- Evaluate our fine-tuned BEiT3-large model on ImageNet val with a single GPU:
```bash
python -m torch.distributed.launch --nproc_per_node=1 run_beit3_finetuning.py \
--model beit3_large_patch16_224 \
--task imagenet \
--batch_size 128 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_224_in1k.pth \
--data_path /path/to/your_data \
--eval \
--dist_eval
```
Expected results:
```
* Acc@1 87.580 Acc@5 98.326
```
# Fine-tuning BEiT-3 on NLVR2 (Visual Reasoning)
## Setup
1. [Setup environment](../README.md#setup).
2. Clone the [repository](https://github.com/lil-lab/nlvr) and sign the [request form](https://goo.gl/forms/yS29stWnFWzrDBFH3) to download the images, then organize the dataset as following structure:
```
/path/to/your_data/
images/train/
0/train-11670-0-img0.png
...
dev/
dev-269-0-img0.png
...
test1/
test1-261-0-img0.png
...
nlvr/ (nlvr repo)
nlvr/
nlvr2/
```
We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
```
from datasets import NLVR2Dataset
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
NLVR2Dataset.make_dataset_index(
data_path="/path/to/your_data",
tokenizer=tokenizer,
nlvr_repo_path="/path/to/your_data/nlvr"
)
```
## Example: Fine-tuning BEiT-3 on NLVR2 (Visual Reasoning)
The BEiT-3 **base** model can be finetuned on NLVR2 using 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_base_patch16_224 \
--task nlvr2 \
--batch_size 32 \
--layer_decay 0.65 \
--lr 7e-4 \
--epochs 20 \
--warmup_epochs 5 \
--drop_path 0.2 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.2 \
--seed 42 \
--save_ckpt_freq 5 \
--enable_deepspeed
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*32 = 256`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models).
- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
- `--lr`: 7e-4 for `BEiT3-base`, 5e-4 for `BEiT3-base-indomain`.
The BEiT-3 **large** model can be finetuned on NLVR2 using 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_large_patch16_224 \
--task nlvr2 \
--batch_size 32 \
--layer_decay 0.85 \
--lr 3e-4 \
--epochs 20 \
--warmup_epochs 5 \
--drop_path 0.2 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.2 \
--seed 42 \
--save_ckpt_freq 5 \
--enable_deepspeed \
--checkpoint_activations
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*32 = 256`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models).
- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
- `--lr`: 3e-4 for `BEiT3-large`, 1e-4 for `BEiT3-large-indomain`.
- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory.
## Example: Evaluate BEiT-3 Finetuned model on NLVR2 (Visual Reasoning)
- Get the result of our fine-tuned BEiT3-base model on NLVR2 test with 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_base_patch16_224 \
--task nlvr2 \
--batch_size 32 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_224_nlvr2.pth \
--data_path /path/to/your_data \
--eval \
--dist_eval
```
Expected results:
```
* Acc 84.386
```
- Get the result of our fine-tuned BEiT3-large model on NLVR2 test with 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_large_patch16_224 \
--task nlvr2 \
--batch_size 32 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_224_nlvr2.pth \
--data_path /path/to/your_data \
--eval \
--dist_eval
```
Expected results:
```
* Acc 89.437
```
# Fine-tuning BEiT-3 on Image-text Retrieval
## COCO Retrieval Setup
1. [Setup environment](../README.md#setup).
2. Download [2014 train images](http://images.cocodataset.org/zips/train2014.zip), [2014 val images](http://images.cocodataset.org/zips/val2014.zip) and [karpathy split](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip), then organize the dataset as following structure:
```
/path/to/your_data/
train2014/
COCO_train2014_000000000009.jpg
...
val2014/
COCO_val2014_000000000042.jpg
...
dataset_coco.json
```
We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
```
from datasets import RetrievalDataset
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
RetrievalDataset.make_coco_dataset_index(
data_path="/path/to/your_data",
tokenizer=tokenizer,
)
```
## Flickr30k Retrieval Setup
1. [Setup environment](README.md#setup).
2. Sign [flickr images request form](https://forms.illinois.edu/sec/229675) and download [karpathy split](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip), then organize the dataset as following structure:
```
/path/to/your_data/
flickr30k-images/
2923475135.jpg
...
dataset_flickr30k.json
```
We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
```
from datasets import RetrievalDataset
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
RetrievalDataset.make_flickr30k_dataset_index(
data_path="/path/to/your_data",
tokenizer=tokenizer,
karpathy_path="/path/to/your_data",
)
```
## Example: Fine-tuning BEiT-3 on Retrieval
The BEiT-3 **base** model can be finetuned on retrieval tasks using 16 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=16 run_beit3_finetuning.py \
--model beit3_base_patch16_384 \
--input_size 384 \
--task coco_retrieval \
--batch_size 192 \
--layer_decay 0.65 \
--lr 2e-4 \
--epochs 15 \
--warmup_epochs 3 \
--drop_path 0.2 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_itc_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.05 \
--seed 42 \
--save_ckpt_freq 5 \
--enable_deepspeed \
--checkpoint_activations
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `192*16 = 3072`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
- `--task`: **coco_retrieval** for COCO retrieval, **flickr30k** for Flickr30k retrieval
- `--lr`: 2e-4 for COCO retrieval, 1e-4 for Flickr30k retrieval
- `--epochs`: 15 for COCO retrieval, 20 for Flickr30k retrieval
- `--warmup_epochs`: 3 for COCO retrieval, 5 for Flickr30k retrieval
- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory
The BEiT-3 **large** model can be finetuned on retrieval tasks using 2x16 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=16 --nnodes=2 --node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR --master_port=$MASTER_PORT run_beit3_finetuning.py \
--model beit3_large_patch16_384 \
--input_size 384 \
--task coco_retrieval \
--batch_size 96 \
--layer_decay 0.85 \
--lr 5e-5 \
--epochs 15 \
--warmup_epochs 3 \
--drop_path 0.2 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_itc_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.05 \
--seed 42 \
--save_ckpt_freq 5 \
--enable_deepspeed \
--checkpoint_activations
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `96*32 = 3072`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
- `--task`: **coco_retrieval** for COCO retrieval, **flickr30k** for Flickr30k retrieval
- `--epochs`: 15 for COCO retrieval, 20 for Flickr30k retrieval
- `--warmup_epochs`: 3 for COCO retrieval, 5 for Flickr30k retrieval
- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory
## Example: Evaluate BEiT-3 Fine-tuned model on COCO Retrieval and Flickr30k Retrieval
- Get the results of our fine-tuned BEiT3-base model on retrieval tasks using a single GPU:
```bash
python -m torch.distributed.launch --nproc_per_node=1 run_beit3_finetuning.py \
--model beit3_base_patch16_384 \
--input_size 384 \
--task coco_retrieval \
--batch_size 16 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_384_coco_retrieval.pth \
--data_path /path/to/your_data \
--eval \
--dist_eval
```
- `--task`: **coco_retrieval** for COCO retrieval, **flickr30k** for Flickr30k retrieval
- `--finetune`: **beit3_base_patch16_384_coco_retrieval.pth** for COCO retrieval, **beit3_base_patch16_384_f30k_retrieval.pth** for Flickr30k retrieval
- Get the results of our fine-tuned BEiT3-large model on retrieval tasks using a single GPU:
```bash
python -m torch.distributed.launch --nproc_per_node=1 run_beit3_finetuning.py \
--model beit3_large_patch16_384 \
--input_size 384 \
--task coco_retrieval \
--batch_size 16 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_384_coco_retrieval.pth \
--data_path /path/to/your_data \
--eval \
--dist_eval
```
- `--task`: **coco_retrieval** for COCO retrieval, **flickr30k** for Flickr30k retrieval
- `--finetune`: **beit3_large_patch16_384_coco_retrieval.pth** for COCO retrieval, **beit3_large_patch16_384_f30k_retrieval.pth** for Flickr30k retrieval
# Fine-tuning BEiT-3 on VQAv2 (Visual Question Answering)
## Setup
1. [Setup environment](../README.md#setup).
2. Download COCO [2014 train images](http://images.cocodataset.org/zips/train2014.zip), [2014 val images](http://images.cocodataset.org/zips/val2014.zip), [2015 test images](http://images.cocodataset.org/zips/test2015.zip), annotations ([train](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip), [val](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip)), and questions ([train](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip), [val](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip), [test](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip)), then organize the dataset as following structure:
```
/path/to/your_data/
train2014/
COCO_train2014_000000000009.jpg
...
val2014/
COCO_val2014_000000000042.jpg
...
test2015/
COCO_test2015_000000000001.jpg
...
vqa/
v2_OpenEnded_mscoco_train2014_questions.json
v2_OpenEnded_mscoco_val2014_questions.json
v2_OpenEnded_mscoco_test2015_questions.json
v2_OpenEnded_mscoco_test-dev2015_questions.json
v2_mscoco_train2014_annotations.json
v2_mscoco_val2014_annotations.json
```
We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
```
from datasets import VQAv2Dataset
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
VQAv2Dataset.make_dataset_index(
data_path="/path/to/your_data",
tokenizer=tokenizer,
annotation_data_path="/path/to/your_data/vqa",
)
```
## Example: Fine-tuning BEiT-3 on VQAv2 (Visual Question Answering)
The BEiT-3 **base** model can be finetuned on VQAv2 using 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_base_patch16_480 \
--input_size 480 \
--task vqav2 \
--batch_size 16 \
--layer_decay 1.0 \
--lr 3e-5 \
--update_freq 1 \
--randaug \
--epochs 10 \
--warmup_epochs 1 \
--drop_path 0.1 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.01 \
--seed 42 \
--save_ckpt_freq 5 \
--task_head_lr_weight 20 \
--opt_betas 0.9 0.98 \
--enable_deepspeed
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*16 = 128`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
The BEiT-3 **large** model can be finetuned on VQAv2 using 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_large_patch16_480 \
--input_size 480 \
--task vqav2 \
--batch_size 16 \
--layer_decay 1.0 \
--lr 2e-5 \
--update_freq 1 \
--randaug \
--epochs 10 \
--warmup_epochs 1 \
--drop_path 0.15 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.01 \
--seed 42 \
--save_ckpt_freq 5 \
--task_head_lr_weight 20 \
--opt_betas 0.9 0.98 \
--enable_deepspeed \
--checkpoint_activations
```
- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*16 = 128`.
- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory
## Example: Evaluate BEiT-3 Finetuned model on VQAv2 (Visual Question Answering)
- Get the prediction file of the fine-tuned BEiT3-base model on VQAv2 test with 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_base_patch16_480 \
--input_size 480 \
--task vqav2 \
--batch_size 16 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_480_vqa.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_prediction \
--eval \
--dist_eval
```
- Get the prediction file of the fine-tuned BEiT3-large model on VQAv2 test with 8 V100-32GB:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_large_patch16_480 \
--input_size 480 \
--task vqav2 \
--batch_size 16 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_480_vqa.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_prediction \
--eval \
--dist_eval
```
Please then submit the prediction file in the `output_dir` to the [evaluation server](https://eval.ai/web/challenges/challenge-page/830/overview) to obtain the VQAv2 test-dev and test-std results.
import re
contractions = {
"aint": "ain't",
"arent": "aren't",
"cant": "can't",
"couldve": "could've",
"couldnt": "couldn't",
"couldn'tve": "couldn't've",
"couldnt've": "couldn't've",
"didnt": "didn't",
"doesnt": "doesn't",
"dont": "don't",
"hadnt": "hadn't",
"hadnt've": "hadn't've",
"hadn'tve": "hadn't've",
"hasnt": "hasn't",
"havent": "haven't",
"hed": "he'd",
"hed've": "he'd've",
"he'dve": "he'd've",
"hes": "he's",
"howd": "how'd",
"howll": "how'll",
"hows": "how's",
"Id've": "I'd've",
"I'dve": "I'd've",
"Im": "I'm",
"Ive": "I've",
"isnt": "isn't",
"itd": "it'd",
"itd've": "it'd've",
"it'dve": "it'd've",
"itll": "it'll",
"let's": "let's",
"maam": "ma'am",
"mightnt": "mightn't",
"mightnt've": "mightn't've",
"mightn'tve": "mightn't've",
"mightve": "might've",
"mustnt": "mustn't",
"mustve": "must've",
"neednt": "needn't",
"notve": "not've",
"oclock": "o'clock",
"oughtnt": "oughtn't",
"ow's'at": "'ow's'at",
"'ows'at": "'ow's'at",
"'ow'sat": "'ow's'at",
"shant": "shan't",
"shed've": "she'd've",
"she'dve": "she'd've",
"she's": "she's",
"shouldve": "should've",
"shouldnt": "shouldn't",
"shouldnt've": "shouldn't've",
"shouldn'tve": "shouldn't've",
"somebody'd": "somebodyd",
"somebodyd've": "somebody'd've",
"somebody'dve": "somebody'd've",
"somebodyll": "somebody'll",
"somebodys": "somebody's",
"someoned": "someone'd",
"someoned've": "someone'd've",
"someone'dve": "someone'd've",
"someonell": "someone'll",
"someones": "someone's",
"somethingd": "something'd",
"somethingd've": "something'd've",
"something'dve": "something'd've",
"somethingll": "something'll",
"thats": "that's",
"thered": "there'd",
"thered've": "there'd've",
"there'dve": "there'd've",
"therere": "there're",
"theres": "there's",
"theyd": "they'd",
"theyd've": "they'd've",
"they'dve": "they'd've",
"theyll": "they'll",
"theyre": "they're",
"theyve": "they've",
"twas": "'twas",
"wasnt": "wasn't",
"wed've": "we'd've",
"we'dve": "we'd've",
"weve": "we've",
"werent": "weren't",
"whatll": "what'll",
"whatre": "what're",
"whats": "what's",
"whatve": "what've",
"whens": "when's",
"whered": "where'd",
"wheres": "where's",
"whereve": "where've",
"whod": "who'd",
"whod've": "who'd've",
"who'dve": "who'd've",
"wholl": "who'll",
"whos": "who's",
"whove": "who've",
"whyll": "why'll",
"whyre": "why're",
"whys": "why's",
"wont": "won't",
"wouldve": "would've",
"wouldnt": "wouldn't",
"wouldnt've": "wouldn't've",
"wouldn'tve": "wouldn't've",
"yall": "y'all",
"yall'll": "y'all'll",
"y'allll": "y'all'll",
"yall'd've": "y'all'd've",
"y'alld've": "y'all'd've",
"y'all'dve": "y'all'd've",
"youd": "you'd",
"youd've": "you'd've",
"you'dve": "you'd've",
"youll": "you'll",
"youre": "you're",
"youve": "you've",
}
manual_map = {
"none": "0",
"zero": "0",
"one": "1",
"two": "2",
"three": "3",
"four": "4",
"five": "5",
"six": "6",
"seven": "7",
"eight": "8",
"nine": "9",
"ten": "10",
}
articles = ["a", "an", "the"]
period_strip = re.compile("(?!<=\d)(\.)(?!\d)")
comma_strip = re.compile("(\d)(\,)(\d)")
punct = [
";",
r"/",
"[",
"]",
'"',
"{",
"}",
"(",
")",
"=",
"+",
"\\",
"_",
"-",
">",
"<",
"@",
"`",
",",
"?",
"!",
]
def normalize_word(token):
_token = token
for p in punct:
if (p + " " in token or " " + p in token) or (
re.search(comma_strip, token) != None
):
_token = _token.replace(p, "")
else:
_token = _token.replace(p, " ")
token = period_strip.sub("", _token, re.UNICODE)
_token = []
temp = token.lower().split()
for word in temp:
word = manual_map.setdefault(word, word)
if word not in articles:
_token.append(word)
for i, word in enumerate(_token):
if word in contractions:
_token[i] = contractions[word]
token = " ".join(_token)
token = token.replace(",", "")
return token
# --------------------------------------------------------
# Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks (https://arxiv.org/abs/2208.10442)
# Github source: https://github.com/microsoft/unilm/tree/master/beit3
# Copyright (c) 2023 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------'
import torch
import torch.nn as nn
import torch.nn.functional as F
from timm.models.registry import register_model
import numpy as np
import utils
from modeling_utils import BEiT3Wrapper, _get_base_config, _get_large_config
class TwoLayerMLP(nn.Module):
def __init__(
self,
in_features,
hidden_features,
out_features,
norm_layer,
norm_input=True,
):
super().__init__()
self.norm1 = norm_layer(in_features) if norm_input else nn.Identity()
self.dense1 = nn.Linear(in_features, hidden_features)
self.norm2 = norm_layer(hidden_features)
self.act = nn.GELU()
self.dense2 = nn.Linear(hidden_features, out_features)
def forward(self, x):
x = self.norm1(x)
x = self.dense1(x)
x = self.norm2(x)
x = self.act(x)
return self.dense2(x)
class Pooler(nn.Module):
def __init__(self, input_features, output_features, norm_layer):
super().__init__()
self.norm = norm_layer(input_features)
self.dense = nn.Linear(input_features, output_features)
self.activation = nn.Tanh()
def forward(self, x):
cls_rep = x[:, 0, :]
cls_rep = self.norm(cls_rep)
pooled_output = self.dense(cls_rep)
pooled_output = self.activation(pooled_output)
return pooled_output
class BEiT3ForVisualReasoning(BEiT3Wrapper):
def __init__(
self,
args,
num_classes,
norm_layer=nn.LayerNorm,
**kwargs
):
super(BEiT3ForVisualReasoning, self).__init__(args=args)
embed_dim = args.encoder_embed_dim
self.head = TwoLayerMLP(
in_features=embed_dim * 4,
hidden_features=embed_dim * 2,
out_features=num_classes,
norm_layer=norm_layer,
)
init_scale = 0.001
self.head.apply(self._init_weights)
if isinstance(self.head.dense1, nn.Linear):
self.head.dense1.weight.data.mul_(init_scale)
self.head.dense1.bias.data.mul_(init_scale)
if isinstance(self.head.dense2, nn.Linear):
self.head.dense2.weight.data.mul_(init_scale)
self.head.dense2.bias.data.mul_(init_scale)
def forward(self, image_a, image_b, text_description, padding_mask, **kwargs):
bsz, _ = text_description.size()
vision_input = torch.cat((image_a, image_b), dim=0)
language_input = torch.cat((text_description, text_description), dim=0)
padding_mask = torch.cat((padding_mask, padding_mask), dim=0)
outputs = self.beit3(
textual_tokens=language_input,
visual_tokens=vision_input,
text_padding_position=padding_mask,
)
x = outputs["encoder_out"]
multiway_split_position = outputs["multiway_split_position"]
vision_cls = x[:, 0, :]
language_cls = x[:, multiway_split_position, :]
cls_rep = torch.cat((vision_cls, language_cls), dim=-1)
a, b = torch.split(cls_rep, split_size_or_sections=[bsz, bsz], dim=0)
cls_rep = torch.cat((a, b), dim=-1)
return self.head(cls_rep)
class BEiT3ForImageClassification(BEiT3Wrapper):
def __init__(
self,
args,
num_classes,
norm_layer=nn.LayerNorm,
**kwargs
):
super(BEiT3ForImageClassification, self).__init__(args=args)
embed_dim = args.encoder_embed_dim
self.fc_norm = norm_layer(embed_dim)
self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
self.fc_norm.apply(self._init_weights)
self.head.apply(self._init_weights)
init_scale = 0.001
if isinstance(self.head, nn.Linear):
self.head.weight.data.mul_(init_scale)
self.head.bias.data.mul_(init_scale)
def forward(self, image, **kwargs):
x = self.beit3(textual_tokens=None, visual_tokens=image)["encoder_out"]
t = x[:, 1:, :]
cls_x = self.fc_norm(t.mean(1))
return self.head(cls_x)
class BEiT3ForCaptioning(BEiT3Wrapper):
def __init__(
self,
args,
**kwargs
):
super(BEiT3ForCaptioning, self).__init__(args=args)
embed_dim = args.encoder_embed_dim
self.mlm_head = nn.Linear(embed_dim, args.vocab_size)
self.mlm_head.apply(self._init_weights)
def forward(self, image, text_ids, padding_mask, language_masked_pos, text_len=None, incremental_state=None, **kwargs):
text_len = text_len if text_len is not None else text_ids.size(1)
image_len = self.beit3.vision_embed.num_position_embeddings()
max_len = text_len + image_len
uni_mask = torch.zeros((max_len, max_len), dtype=torch.long, device=text_ids.device)
i_start, i_end = 0, image_len
t_start, t_end = image_len, max_len
# triangle mask for caption to caption
uni_mask[t_start:t_end, t_start:t_end] = torch.tril(torch.ones(text_len, text_len, dtype=torch.long, device=text_ids.device))
# full attention for caption to image
uni_mask[t_start:t_end, i_start:i_end] = 1
# full attention for image to image
uni_mask[i_start:i_end, i_start:i_end] = 1
uni_mask = 1-uni_mask
if incremental_state is not None:
for idx in range(self.get_num_layers()):
if idx not in incremental_state:
incremental_state[idx] = {}
# for incremental decoding
positions = None
if image is None:
uni_mask = uni_mask[-2:]
padding_mask = None
# start position (2 (fairseq starts at 2) + cur_position) is equal to text_len
positions = torch.arange(text_len, text_ids.size(1) + text_len, device=text_ids.device).long().unsqueeze(0)
outputs = self.beit3(
textual_tokens=text_ids,
visual_tokens=image,
text_padding_position=padding_mask,
attn_mask=uni_mask,
incremental_state=incremental_state,
positions=positions,
)
if image is not None:
text_feats = outputs["encoder_out"][:, image_len:]
else:
text_feats = outputs["encoder_out"]
if language_masked_pos is not None:
text_feats = text_feats[language_masked_pos.bool()]
return self.mlm_head(text_feats), incremental_state
class BEiT3ForVisualQuestionAnswering(BEiT3Wrapper):
def __init__(
self,
args,
num_classes,
norm_layer=nn.LayerNorm,
**kwargs
):
super(BEiT3ForVisualQuestionAnswering, self).__init__(args=args)
embed_dim = args.encoder_embed_dim
self.pooler = Pooler(
input_features=embed_dim,
output_features=embed_dim,
norm_layer=norm_layer,
)
self.pooler.apply(self._init_weights)
self.head = nn.Sequential(
nn.Linear(embed_dim, embed_dim * 2),
norm_layer(embed_dim * 2),
nn.GELU(),
nn.Linear(embed_dim * 2, num_classes),
)
self.head.apply(self._init_weights)
def forward(self, image, question, padding_mask, **kwargs):
outputs = self.beit3(
textual_tokens=question,
visual_tokens=image,
text_padding_position=padding_mask,
)
x = outputs["encoder_out"]
cls_rep = self.pooler(x)
return self.head(cls_rep)
class BEiT3ForRetrieval(BEiT3Wrapper):
def __init__(
self,
args,
**kwargs
):
super(BEiT3ForRetrieval, self).__init__(args=args)
embed_dim = args.encoder_embed_dim
self.language_head = nn.Linear(embed_dim, embed_dim, bias=False)
self.vision_head = nn.Linear(embed_dim, embed_dim, bias=False)
self.language_head.apply(self._init_weights)
self.vision_head.apply(self._init_weights)
self.criterion = utils.ClipLoss(
rank=utils.get_rank(),
world_size=utils.get_world_size(),
)
self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
def forward(self, image=None, text_description=None, padding_mask=None, only_infer=False, **kwargs):
if image is not None:
outputs = self.beit3(
textual_tokens=None,
visual_tokens=image,
text_padding_position=None,
)
x = outputs["encoder_out"]
vision_cls = self.vision_head(x[:, 0, :])
vision_cls = F.normalize(vision_cls, dim=-1)
else:
vision_cls = None
if text_description is not None:
outputs = self.beit3(
textual_tokens=text_description,
visual_tokens=None,
text_padding_position=padding_mask,
)
x = outputs["encoder_out"]
language_cls = self.language_head(x[:, 0, :])
language_cls = F.normalize(language_cls, dim=-1)
else:
language_cls = None
if only_infer:
return vision_cls, language_cls
else:
loss, logits_per_image, logits_per_text = self.criterion(
vision_cls, language_cls, self.logit_scale.exp())
return loss, vision_cls, language_cls
@register_model
def beit3_base_patch16_224_imageclassification(pretrained=False, **kwargs):
args = _get_base_config(**kwargs)
args.normalize_output = False
model = BEiT3ForImageClassification(args, num_classes=1000, **kwargs)
return model
@register_model
def beit3_large_patch16_224_imageclassification(pretrained=False, **kwargs):
args = _get_large_config(**kwargs)
args.normalize_output = False
model = BEiT3ForImageClassification(args, num_classes=1000, **kwargs)
return model
@register_model
def beit3_base_patch16_224_nlvr2(pretrained=False, **kwargs):
args = _get_base_config(**kwargs)
model = BEiT3ForVisualReasoning(args, num_classes=2, **kwargs)
return model
@register_model
def beit3_large_patch16_224_nlvr2(pretrained=False, **kwargs):
args = _get_large_config(**kwargs)
model = BEiT3ForVisualReasoning(args, num_classes=2, **kwargs)
return model
@register_model
def beit3_base_patch16_384_vqav2(pretrained=False, **kwargs):
args = _get_base_config(img_size=384, **kwargs)
args.normalize_output = False
model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
return model
@register_model
def beit3_base_patch16_480_vqav2(pretrained=False, **kwargs):
args = _get_base_config(img_size=480, **kwargs)
args.normalize_output = False
model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
return model
@register_model
def beit3_large_patch16_384_vqav2(pretrained=False, **kwargs):
args = _get_large_config(img_size=384, **kwargs)
args.normalize_output = False
model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
return model
@register_model
def beit3_large_patch16_480_vqav2(pretrained=False, **kwargs):
args = _get_large_config(img_size=480, **kwargs)
args.normalize_output = False
model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
return model
@register_model
def beit3_large_patch16_768_vqav2(pretrained=False, **kwargs):
args = _get_large_config(img_size=768, **kwargs)
args.normalize_output = False
model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
return model
@register_model
def beit3_base_patch16_224_captioning(pretrained=False, **kwargs):
args = _get_base_config(**kwargs)
model = BEiT3ForCaptioning(args, **kwargs)
return model
@register_model
def beit3_base_patch16_480_captioning(pretrained=False, **kwargs):
args = _get_base_config(img_size=480, **kwargs)
model = BEiT3ForCaptioning(args, **kwargs)
return model
@register_model
def beit3_large_patch16_480_captioning(pretrained=False, **kwargs):
args = _get_large_config(img_size=480, **kwargs)
model = BEiT3ForCaptioning(args, **kwargs)
return model
@register_model
def beit3_base_patch16_224_retrieval(pretrained=False, **kwargs):
args = _get_base_config(**kwargs)
model = BEiT3ForRetrieval(args, **kwargs)
return model
@register_model
def beit3_base_patch16_384_retrieval(pretrained=False, **kwargs):
args = _get_base_config(img_size=384, **kwargs)
model = BEiT3ForRetrieval(args, **kwargs)
return model
@register_model
def beit3_large_patch16_384_retrieval(pretrained=False, **kwargs):
args = _get_large_config(img_size=384, **kwargs)
model = BEiT3ForRetrieval(args, **kwargs)
return model
# --------------------------------------------------------
# Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks (https://arxiv.org/abs/2208.10442)
# Github source: https://github.com/microsoft/unilm/tree/master/beit3
# Copyright (c) 2023 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------'
import math
import torch
import torch.nn as nn
from timm.models.layers import trunc_normal_ as __call_trunc_normal_
from torchscale.model.BEiT3 import BEiT3
from torchscale.architecture.config import EncoderConfig
def trunc_normal_(tensor, mean=0., std=1.):
__call_trunc_normal_(tensor, mean=mean, std=std, a=-std, b=std)
def _get_base_config(
img_size=224, patch_size=16, drop_path_rate=0,
checkpoint_activations=None, mlp_ratio=4, vocab_size=64010, **kwargs
):
return EncoderConfig(
img_size=img_size, patch_size=patch_size, vocab_size=vocab_size, multiway=True,
layernorm_embedding=False, normalize_output=True, no_output_layer=True,
drop_path_rate=drop_path_rate, encoder_embed_dim=768, encoder_attention_heads=12,
encoder_ffn_embed_dim=int(768 * mlp_ratio), encoder_layers=12,
checkpoint_activations=checkpoint_activations,
)
def _get_large_config(
img_size=224, patch_size=16, drop_path_rate=0,
checkpoint_activations=None, mlp_ratio=4, vocab_size=64010, **kwargs
):
return EncoderConfig(
img_size=img_size, patch_size=patch_size, vocab_size=vocab_size, multiway=True,
layernorm_embedding=False, normalize_output=True, no_output_layer=True,
drop_path_rate=drop_path_rate, encoder_embed_dim=1024, encoder_attention_heads=16,
encoder_ffn_embed_dim=int(1024 * mlp_ratio), encoder_layers=24,
checkpoint_activations=checkpoint_activations,
)
class BEiT3Wrapper(nn.Module):
def __init__(self, args, **kwargs):
super().__init__()
self.args = args
self.beit3 = BEiT3(args)
self.apply(self._init_weights)
def fix_init_weight(self):
def rescale(param, layer_id):
param.div_(math.sqrt(2.0 * layer_id))
for layer_id, layer in enumerate(self.blocks):
rescale(layer.attn.proj.weight.data, layer_id + 1)
rescale(layer.mlp.fc2.weight.data, layer_id + 1)
def get_num_layers(self):
return self.beit3.encoder.num_layers
@torch.jit.ignore
def no_weight_decay(self):
return {'pos_embed', 'cls_token', 'beit3.encoder.embed_positions.A.weight', 'beit3.vision_embed.cls_token', 'logit_scale'}
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=.02)
if isinstance(m, nn.Linear) and m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
# --------------------------------------------------------
# Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks (https://arxiv.org/abs/2208.10442)
# Github source: https://github.com/microsoft/unilm/tree/master/beit3
# Copyright (c) 2023 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------'
from torch import optim as optim
from timm.optim.lookahead import Lookahead
import json
def get_num_layer_for_vit(var_name, num_max_layer):
if "embed" in var_name:
return 0
elif var_name in (
"cls_token", "mask_token", "pos_embed", "language_pos_embed",
"word_embeddings.weight", "vision_cls_token", "vision_pos_embed"
):
return 0
elif var_name.startswith("patch_embed"):
return 0
elif var_name.startswith("rel_pos_bias"):
return num_max_layer - 1
elif "layers." in var_name:
layer_id = int(var_name.split('layers.')[1].split('.')[0])
return layer_id + 1
else:
return num_max_layer - 1
def get_is_head_flag_for_vit(var_name, num_max_layer):
if var_name.startswith("head"):
return 1
# elif var_name.startswith("pooler"):
# return 1
else:
return 0
class LayerDecayValueAssigner(object):
def __init__(self, values, scale_handler=None):
self.scale_handler = scale_handler or get_num_layer_for_vit
self.values = values
def get_scale(self, layer_id):
return self.values[layer_id]
def get_layer_id(self, var_name):
return self.scale_handler(var_name, len(self.values))
# The implementation code is modified from Timm (https://github.com/huggingface/pytorch-image-models/tree/main/timm
def get_parameter_groups(model, weight_decay=1e-5, skip_list=(), get_num_layer=None, get_layer_scale=None):
parameter_group_names = {}
parameter_group_vars = {}
for name, param in model.named_parameters():
if not param.requires_grad:
continue # frozen weights
if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
group_name = "no_decay"
this_weight_decay = 0.
else:
group_name = "decay"
this_weight_decay = weight_decay
if get_num_layer is not None:
layer_id = get_num_layer(name)
group_name = "layer_%d_%s" % (layer_id, group_name)
else:
layer_id = None
if group_name not in parameter_group_names:
if get_layer_scale is not None:
scale = get_layer_scale(layer_id)
else:
scale = 1.
parameter_group_names[group_name] = {
"weight_decay": this_weight_decay,
"params": [],
"lr_scale": scale
}
parameter_group_vars[group_name] = {
"weight_decay": this_weight_decay,
"params": [],
"lr_scale": scale
}
parameter_group_vars[group_name]["params"].append(param)
parameter_group_names[group_name]["params"].append(name)
print("Param groups = %s" % json.dumps(parameter_group_names, indent=2))
return list(parameter_group_vars.values())
def create_optimizer(args, model, get_num_layer=None, get_layer_scale=None, filter_bias_and_bn=True, skip_list=None):
opt_lower = args.opt.lower()
weight_decay = args.weight_decay
if weight_decay and filter_bias_and_bn:
skip = {}
if skip_list is not None:
skip = skip_list
elif hasattr(model, 'no_weight_decay'):
skip = model.no_weight_decay()
parameters = get_parameter_groups(model, weight_decay, skip, get_num_layer, get_layer_scale)
weight_decay = 0.
else:
parameters = model.parameters()
opt_args = dict(lr=args.lr, weight_decay=weight_decay)
if hasattr(args, 'opt_eps') and args.opt_eps is not None:
opt_args['eps'] = args.opt_eps
if hasattr(args, 'opt_betas') and args.opt_betas is not None:
opt_args['betas'] = args.opt_betas
opt_split = opt_lower.split('_')
opt_lower = opt_split[-1]
if opt_lower == 'adamw':
optimizer = optim.AdamW(parameters, **opt_args)
else:
raise ValueError("Invalid optimizer")
if len(opt_split) > 1:
if opt_split[0] == 'lookahead':
optimizer = Lookahead(optimizer)
return optimizer
import cv2
import numpy as np
## aug functions
def identity_func(img):
return img
def autocontrast_func(img, cutoff=0):
'''
same output as PIL.ImageOps.autocontrast
'''
n_bins = 256
def tune_channel(ch):
n = ch.size
cut = cutoff * n // 100
if cut == 0:
high, low = ch.max(), ch.min()
else:
hist = cv2.calcHist([ch], [0], None, [n_bins], [0, n_bins])
low = np.argwhere(np.cumsum(hist) > cut)
low = 0 if low.shape[0] == 0 else low[0]
high = np.argwhere(np.cumsum(hist[::-1]) > cut)
high = n_bins - 1 if high.shape[0] == 0 else n_bins - 1 - high[0]
if high <= low:
table = np.arange(n_bins)
else:
scale = (n_bins - 1) / (high - low)
offset = -low * scale
table = np.arange(n_bins) * scale + offset
table[table < 0] = 0
table[table > n_bins - 1] = n_bins - 1
table = table.clip(0, 255).astype(np.uint8)
return table[ch]
channels = [tune_channel(ch) for ch in cv2.split(img)]
out = cv2.merge(channels)
return out
def equalize_func(img):
'''
same output as PIL.ImageOps.equalize
PIL's implementation is different from cv2.equalize
'''
n_bins = 256
def tune_channel(ch):
hist = cv2.calcHist([ch], [0], None, [n_bins], [0, n_bins])
non_zero_hist = hist[hist != 0].reshape(-1)
step = np.sum(non_zero_hist[:-1]) // (n_bins - 1)
if step == 0: return ch
n = np.empty_like(hist)
n[0] = step // 2
n[1:] = hist[:-1]
table = (np.cumsum(n) // step).clip(0, 255).astype(np.uint8)
return table[ch]
channels = [tune_channel(ch) for ch in cv2.split(img)]
out = cv2.merge(channels)
return out
def rotate_func(img, degree, fill=(0, 0, 0)):
'''
like PIL, rotate by degree, not radians
'''
H, W = img.shape[0], img.shape[1]
center = W / 2, H / 2
M = cv2.getRotationMatrix2D(center, degree, 1)
out = cv2.warpAffine(img, M, (W, H), borderValue=fill)
return out
def solarize_func(img, thresh=128):
'''
same output as PIL.ImageOps.posterize
'''
table = np.array([el if el < thresh else 255 - el for el in range(256)])
table = table.clip(0, 255).astype(np.uint8)
out = table[img]
return out
def color_func(img, factor):
'''
same output as PIL.ImageEnhance.Color
'''
## implementation according to PIL definition, quite slow
# degenerate = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)[:, :, np.newaxis]
# out = blend(degenerate, img, factor)
# M = (
# np.eye(3) * factor
# + np.float32([0.114, 0.587, 0.299]).reshape(3, 1) * (1. - factor)
# )[np.newaxis, np.newaxis, :]
M = (
np.float32([
[0.886, -0.114, -0.114],
[-0.587, 0.413, -0.587],
[-0.299, -0.299, 0.701]]) * factor
+ np.float32([[0.114], [0.587], [0.299]])
)
out = np.matmul(img, M).clip(0, 255).astype(np.uint8)
return out
def contrast_func(img, factor):
"""
same output as PIL.ImageEnhance.Contrast
"""
mean = np.sum(np.mean(img, axis=(0, 1)) * np.array([0.114, 0.587, 0.299]))
table = np.array([(
el - mean) * factor + mean
for el in range(256)
]).clip(0, 255).astype(np.uint8)
out = table[img]
return out
def brightness_func(img, factor):
'''
same output as PIL.ImageEnhance.Contrast
'''
table = (np.arange(256, dtype=np.float32) * factor).clip(0, 255).astype(np.uint8)
out = table[img]
return out
def sharpness_func(img, factor):
'''
The differences the this result and PIL are all on the 4 boundaries, the center
areas are same
'''
kernel = np.ones((3, 3), dtype=np.float32)
kernel[1][1] = 5
kernel /= 13
degenerate = cv2.filter2D(img, -1, kernel)
if factor == 0.0:
out = degenerate
elif factor == 1.0:
out = img
else:
out = img.astype(np.float32)
degenerate = degenerate.astype(np.float32)[1:-1, 1:-1, :]
out[1:-1, 1:-1, :] = degenerate + factor * (out[1:-1, 1:-1, :] - degenerate)
out = out.astype(np.uint8)
return out
def shear_x_func(img, factor, fill=(0, 0, 0)):
H, W = img.shape[0], img.shape[1]
M = np.float32([[1, factor, 0], [0, 1, 0]])
out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
return out
def translate_x_func(img, offset, fill=(0, 0, 0)):
'''
same output as PIL.Image.transform
'''
H, W = img.shape[0], img.shape[1]
M = np.float32([[1, 0, -offset], [0, 1, 0]])
out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
return out
def translate_y_func(img, offset, fill=(0, 0, 0)):
'''
same output as PIL.Image.transform
'''
H, W = img.shape[0], img.shape[1]
M = np.float32([[1, 0, 0], [0, 1, -offset]])
out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
return out
def posterize_func(img, bits):
'''
same output as PIL.ImageOps.posterize
'''
out = np.bitwise_and(img, np.uint8(255 << (8 - bits)))
return out
def shear_y_func(img, factor, fill=(0, 0, 0)):
H, W = img.shape[0], img.shape[1]
M = np.float32([[1, 0, 0], [factor, 1, 0]])
out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
return out
def cutout_func(img, pad_size, replace=(0, 0, 0)):
replace = np.array(replace, dtype=np.uint8)
H, W = img.shape[0], img.shape[1]
rh, rw = np.random.random(2)
pad_size = pad_size // 2
ch, cw = int(rh * H), int(rw * W)
x1, x2 = max(ch - pad_size, 0), min(ch + pad_size, H)
y1, y2 = max(cw - pad_size, 0), min(cw + pad_size, W)
out = img.copy()
out[x1:x2, y1:y2, :] = replace
return out
### level to args
def enhance_level_to_args(MAX_LEVEL):
def level_to_args(level):
return ((level / MAX_LEVEL) * 1.8 + 0.1,)
return level_to_args
def shear_level_to_args(MAX_LEVEL, replace_value):
def level_to_args(level):
level = (level / MAX_LEVEL) * 0.3
if np.random.random() > 0.5: level = -level
return (level, replace_value)
return level_to_args
def translate_level_to_args(translate_const, MAX_LEVEL, replace_value):
def level_to_args(level):
level = (level / MAX_LEVEL) * float(translate_const)
if np.random.random() > 0.5: level = -level
return (level, replace_value)
return level_to_args
def cutout_level_to_args(cutout_const, MAX_LEVEL, replace_value):
def level_to_args(level):
level = int((level / MAX_LEVEL) * cutout_const)
return (level, replace_value)
return level_to_args
def solarize_level_to_args(MAX_LEVEL):
def level_to_args(level):
level = int((level / MAX_LEVEL) * 256)
return (level, )
return level_to_args
def none_level_to_args(level):
return ()
def posterize_level_to_args(MAX_LEVEL):
def level_to_args(level):
level = int((level / MAX_LEVEL) * 4)
return (level, )
return level_to_args
def rotate_level_to_args(MAX_LEVEL, replace_value):
def level_to_args(level):
level = (level / MAX_LEVEL) * 30
if np.random.random() < 0.5:
level = -level
return (level, replace_value)
return level_to_args
func_dict = {
'Identity': identity_func,
'AutoContrast': autocontrast_func,
'Equalize': equalize_func,
'Rotate': rotate_func,
'Solarize': solarize_func,
'Color': color_func,
'Contrast': contrast_func,
'Brightness': brightness_func,
'Sharpness': sharpness_func,
'ShearX': shear_x_func,
'TranslateX': translate_x_func,
'TranslateY': translate_y_func,
'Posterize': posterize_func,
'ShearY': shear_y_func,
}
translate_const = 10
MAX_LEVEL = 10
replace_value = (128, 128, 128)
arg_dict = {
'Identity': none_level_to_args,
'AutoContrast': none_level_to_args,
'Equalize': none_level_to_args,
'Rotate': rotate_level_to_args(MAX_LEVEL, replace_value),
'Solarize': solarize_level_to_args(MAX_LEVEL),
'Color': enhance_level_to_args(MAX_LEVEL),
'Contrast': enhance_level_to_args(MAX_LEVEL),
'Brightness': enhance_level_to_args(MAX_LEVEL),
'Sharpness': enhance_level_to_args(MAX_LEVEL),
'ShearX': shear_level_to_args(MAX_LEVEL, replace_value),
'TranslateX': translate_level_to_args(
translate_const, MAX_LEVEL, replace_value
),
'TranslateY': translate_level_to_args(
translate_const, MAX_LEVEL, replace_value
),
'Posterize': posterize_level_to_args(MAX_LEVEL),
'ShearY': shear_level_to_args(MAX_LEVEL, replace_value),
}
class RandomAugment(object):
def __init__(self, N=2, M=10, isPIL=False, augs=[]):
self.N = N
self.M = M
self.isPIL = isPIL
if augs:
self.augs = augs
else:
self.augs = list(arg_dict.keys())
def get_random_ops(self):
sampled_ops = np.random.choice(self.augs, self.N)
return [(op, 0.5, self.M) for op in sampled_ops]
def __call__(self, img):
if self.isPIL:
img = np.array(img)
ops = self.get_random_ops()
for name, prob, level in ops:
if np.random.random() > prob:
continue
args = arg_dict[name](level)
img = func_dict[name](img, *args)
return img
if __name__ == '__main__':
a = RandomAugment()
img = np.random.randn(32, 32, 3)
a(img)
timm==0.4.12
Pillow
blobfile
mypy
numpy
pytest
requests
einops
tensorboardX
scipy
ftfy
opencv-python
sentencepiece
pyarrow
torchmetrics==0.7.3
transformers
pycocotools
pycocoevalcap
torchscale==0.2.0
This diff is collapsed.
#!/bin/bash/
export HIP_VISIBLE_DEVICES=0,1,2,3 # 自行修改为训练的卡号和数量
export HSA_FORCE_FINE_GRAIN_PCIE=1
export USE_MIOPEN_BATCHNORM=1
python -m torch.distributed.launch --nproc_per_node=4 run_beit3_finetuning.py \
--model beit3_base_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 32 \
--layer_decay 1.0 \
--lr 4e-5 \
--randaug \
--epochs 10 \
--warmup_epochs 1 \
--drop_path 0.1 \
--sentencepiece_model ./pretrained_models/beit3.spm \
--finetune ./pretrained_models/beit3_base_patch16_224.pth \
--data_path /home/data/coco2014 \
--output_dir ./save_models/ \
--log_dir ./logs \
--weight_decay 0.05 \
--seed 42 \
--save_ckpt_freq 5 \
--num_max_bpe_tokens 32 \
--captioning_mask_prob 0.7 \
--drop_worst_after 12000 \
--dist_eval \
--checkpoint_activations \
--enable_deepspeed
\ No newline at end of file
This diff is collapsed.
#!/bin/bash/
export HSA_FORCE_FINE_GRAIN_PCIE=1
export USE_MIOPEN_BATCHNORM=1
python -m torch.distributed.launch --nproc_per_node=4 run_beit3_finetuning.py \
--model beit3_base_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 16 \
--sentencepiece_model ./pretrained_models/beit3.spm \
--finetune ./pretrained_models/beit3_base_patch16_480_coco_captioning.pth \
--data_path ../../data/coco2014/ \
--output_dir ./save_models \
--eval \
--dist_eval
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment