.

a245fbd1 · chenych · c501623c · a245fbd1 · a245fbd1 · a245fbd1
Commit a245fbd1 authored Dec 22, 2023 by chenych
18 changed files
--- a/beit3/README.md
+++ b/beit3/README.md
--- a/beit3/datasets.py
+++ b/beit3/datasets.py
--- a/beit3/engine_for_finetuning.py
+++ b/beit3/engine_for_finetuning.py
--- a/beit3/get_started/get_started_for_captioning.md
+++ b/beit3/get_started/get_started_for_captioning.md
+# Fine-tuning BEiT-3 on Image Captioning
+
+## COCO Captioning Setup
+
+1. [Setup environment](../README.md#setup).
+2. Download [2014 train images](http://images.cocodataset.org/zips/train2014.zip), [2014 val images](http://images.cocodataset.org/zips/val2014.zip) and [karpathy split](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip), then organize the dataset as following structure:
+
+```
+/path/to/your_data/
+  train2014/            
+    COCO_train2014_000000000009.jpg                
+    ...
+  val2014/              
+    COCO_val2014_000000000042.jpg
+    ...       
+  dataset_coco.json
+```
+
+We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
+```
+from datasets import CaptioningDataset
+from transformers import XLMRobertaTokenizer
+
+tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
+
+CaptioningDataset.make_coco_captioning_dataset_index(
+    data_path="/path/to/your_data",
+    tokenizer=tokenizer,
+)
+```
+
+
+## NoCaps Setup
+
+1. [Setup environment](README.md#setup).
+2. Download [NoCaps val set](https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json), [NoCaps test set](https://s3.amazonaws.com/nocaps/nocaps_test_image_info.json) and download imags using the urls in val and test json files, then organize the dataset as following structure:
+
+```
+/path/to/your_data/
+  val/            
+    09c863d76bcf6b00.jpg                
+    ...
+  test/              
+    19dc6913830a0a21.jpg
+    ...       
+  nocaps_val_4500_captions.json
+  nocaps_test_image_info.json
+```
+
+We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
+```
+from datasets import CaptioningDataset
+from transformers import XLMRobertaTokenizer
+
+tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
+
+CaptioningDataset.make_nocaps_captioning_dataset_index(
+    data_path="/path/to/your_data",
+)
+```
+We use COCO captioning training set as the training data of NoCaps.
+
+
+## Example: Fine-tuning BEiT-3 on Captioning
+
+The BEiT-3 **base** model can be fine-tuned on captioning tasks using 8 V100-32GB:
+
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_base_patch16_480 \
+        --input_size 480 \
+        --task coco_captioning \
+        --batch_size 32 \
+        --layer_decay 1.0 \
+        --lr 4e-5 \
+        --randaug \
+        --epochs 10 \
+        --warmup_epochs 1 \
+        --drop_path 0.1 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.05 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --num_max_bpe_tokens 32 \
+        --captioning_mask_prob 0.7 \
+        --drop_worst_after 12000 \
+        --dist_eval \
+        --checkpoint_activations \
+        --enable_deepspeed
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*32 = 256`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models).
+- `--task`: **coco_captioning** for COCO captioning and **nocaps** for NoCaps dataset.
+- `lr`: 4e-5 for COCO captioning and 1e-5 for NoCaps.
+- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
+- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory.
+
+
+The BEiT-3 **large** model can be fine-tuned on captioning tasks using 8 V100-32GB:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_large_patch16_480 \
+        --input_size 480 \
+        --task coco_captioning \
+        --batch_size 32 \
+        --layer_decay 1.0 \
+        --lr 8e-6 \
+        --randaug \
+        --epochs 10 \
+        --warmup_epochs 1 \
+        --drop_path 0.1 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.05 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --num_max_bpe_tokens 32 \
+        --captioning_mask_prob 0.7 \
+        --drop_worst_after 12000 \
+        --dist_eval \
+        --checkpoint_activations \
+        --enable_deepspeed
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*32 = 256`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models).
+- `--task`: **coco_captioning** for COCO captioning and **nocaps** for NoCaps dataset.
+- `lr`: 8e-6 for COCO captioning and NoCaps.
+- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
+- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory.
+
+
+## Example: Evaluate BEiT-3 Fine-tuned model on Captioning
+
+- Get the prediction file of the fine-tuned BEiT3-base model on captioning with 8 V100-32GB:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_base_patch16_480 \
+        --input_size 480 \
+        --task coco_captioning \
+        --batch_size 16 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_patch16_480_coco_captioning.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_prediction \
+        --eval \
+        --dist_eval
+```
+- `--task`: **coco_captioning** for COCO captioning and **nocaps** for NoCaps dataset.
+- `--finetune`: **beit3_base_patch16_480_coco_captioning.pth** for COCO captioning and **beit3_base_patch16_480_nocaps.pth** for NoCaps dataset.
+
+- Get the prediction file of the fine-tuned BEiT3-large model on captioning with 8 V100-32GB:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_large_patch16_480 \
+        --input_size 480 \
+        --task coco_captioning \
+        --batch_size 16 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_patch16_480_coco_captioning.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_prediction \
+        --eval \
+        --dist_eval
+```
+- `--task`: **coco_captioning** for COCO captioning and **nocaps** for NoCaps dataset.
+- `--finetune`: **beit3_large_patch16_480_coco_captioning.pth** for COCO captioning and **beit3_large_patch16_480_nocaps.pth** for NoCaps dataset.
+
+Please then submit the prediction file in the `output_dir` to the [evaluation server](https://eval.ai/web/challenges/challenge-page/355/overview) to obtain the NoCaps val and test results.
--- a/beit3/get_started/get_started_for_image_classification.md
+++ b/beit3/get_started/get_started_for_image_classification.md
+# Fine-tuning BEiT-3 on ImageNet-1k (Image Classification)
+
+
+## Setup
+
+1. [Setup environment](../README.md#setup).
+2. Download and extract ImageNet-1k from http://image-net.org/.
+
+The directory structure is the standard layout of torchvision's [`datasets.ImageFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder). The training and validation data are expected to be in the `train/` folder and `val/` folder, respectively:
+
+```
+/path/to/imagenet/
+  train/
+    class1/
+      img1.jpeg
+    class2/
+      img2.jpeg
+  val/
+    class1/
+      img3.jpeg
+    class/2
+      img4.jpeg
+```
+
+We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
+```
+from datasets import ImageNetDataset
+
+ImageNetDataset.make_dataset_index(
+    train_data_path = "/path/to/your_data/train",
+    val_data_path = "/path/to/your_data/val",
+    index_path = "/path/to/your_data"
+)
+```
+
+
+## Example: Fine-tuning BEiT-3 on ImageNet-1k (Image Classification)
+
+The BEiT-3 **base** model can be finetuned on ImageNet-1k using 8 V100-32GB:
+
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_base_patch16_224 \
+        --task imagenet \
+        --batch_size 128 \
+        --layer_decay 0.65 \
+        --lr 7e-4 \
+        --update_freq 1 \
+        --epochs 50 \
+        --warmup_epochs 5 \
+        --drop_path 0.15 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.05 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --dist_eval \
+        --mixup 0.8 \
+        --cutmix 1.0 \
+        --enable_deepspeed
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*128*1 = 1024`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
+- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
+
+
+The BEiT-3 **large** model can be finetuned on ImageNet-1k using a DGX box (8 V100-32GB):
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_large_patch16_224 \
+        --task imagenet \
+        --batch_size 128 \
+        --layer_decay 0.8 \
+        --lr 2e-4 \
+        --update_freq 1 \
+        --epochs 50 \
+        --warmup_epochs 5 \
+        --drop_path 0.25 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.05 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --dist_eval \
+        --mixup 0.8 \
+        --cutmix 1.0 \
+        --enable_deepspeed \
+        --checkpoint_activations
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*128 = 1024`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
+- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
+- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory
+
+## Example: Evaluate BEiT-3 Finetuned model on ImageNet-1k (Image Classification)
+
+- Evaluate our fine-tuned BEiT3-base model on ImageNet val with a single GPU:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=1 run_beit3_finetuning.py \
+        --model beit3_base_patch16_224 \
+        --task imagenet \
+        --batch_size 128 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_patch16_224_in1k.pth \
+        --data_path /path/to/your_data \
+        --eval \
+        --dist_eval
+```
+
+Expected results:
+```
+* Acc@1 85.400 Acc@5 97.630
+```
+
+- Evaluate our fine-tuned BEiT3-large model on ImageNet val with a single GPU:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=1 run_beit3_finetuning.py \
+        --model beit3_large_patch16_224 \
+        --task imagenet \
+        --batch_size 128 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_patch16_224_in1k.pth \
+        --data_path /path/to/your_data \
+        --eval \
+        --dist_eval
+```
+
+Expected results:
+```
+* Acc@1 87.580 Acc@5 98.326
+```
--- a/beit3/get_started/get_started_for_nlvr2.md
+++ b/beit3/get_started/get_started_for_nlvr2.md
+# Fine-tuning BEiT-3 on NLVR2 (Visual Reasoning)
+
+
+## Setup
+
+1. [Setup environment](../README.md#setup).
+2. Clone the [repository](https://github.com/lil-lab/nlvr) and sign the [request form](https://goo.gl/forms/yS29stWnFWzrDBFH3) to download the images, then organize the dataset as following structure:
+
+```
+/path/to/your_data/
+  images/train/                            
+    0/train-11670-0-img0.png
+    ...
+  dev/              
+    dev-269-0-img0.png
+    ...  
+  test1/              
+    test1-261-0-img0.png
+    ...         
+  nlvr/ (nlvr repo)
+    nlvr/
+    nlvr2/
+```
+
+We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
+```
+from datasets import NLVR2Dataset
+from transformers import XLMRobertaTokenizer
+
+tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
+
+NLVR2Dataset.make_dataset_index(
+    data_path="/path/to/your_data", 
+    tokenizer=tokenizer, 
+    nlvr_repo_path="/path/to/your_data/nlvr"
+)
+```
+
+
+## Example: Fine-tuning BEiT-3 on NLVR2 (Visual Reasoning)
+
+The BEiT-3 **base** model can be finetuned on NLVR2 using 8 V100-32GB:
+
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_base_patch16_224 \
+        --task nlvr2 \
+        --batch_size 32 \
+        --layer_decay 0.65 \
+        --lr 7e-4 \
+        --epochs 20 \
+        --warmup_epochs 5 \
+        --drop_path 0.2 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.2 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --enable_deepspeed
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*32 = 256`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models).
+- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
+- `--lr`: 7e-4 for `BEiT3-base`, 5e-4 for `BEiT3-base-indomain`.
+
+
+The BEiT-3 **large** model can be finetuned on NLVR2 using 8 V100-32GB:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_large_patch16_224 \
+        --task nlvr2 \
+        --batch_size 32 \
+        --layer_decay 0.85 \
+        --lr 3e-4 \
+        --epochs 20 \
+        --warmup_epochs 5 \
+        --drop_path 0.2 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.2 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --enable_deepspeed \
+        --checkpoint_activations
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*32 = 256`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models).
+- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
+- `--lr`: 3e-4 for `BEiT3-large`, 1e-4 for `BEiT3-large-indomain`.
+- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory.
+
+
+## Example: Evaluate BEiT-3 Finetuned model on NLVR2 (Visual Reasoning)
+
+- Get the result of our fine-tuned BEiT3-base model on NLVR2 test with 8 V100-32GB:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_base_patch16_224 \
+        --task nlvr2 \
+        --batch_size 32 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_patch16_224_nlvr2.pth \
+        --data_path /path/to/your_data \
+        --eval \
+        --dist_eval
+```
+
+Expected results:
+```
+* Acc 84.386
+```
+
+- Get the result of our fine-tuned BEiT3-large model on NLVR2 test with 8 V100-32GB:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_large_patch16_224 \
+        --task nlvr2 \
+        --batch_size 32 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_patch16_224_nlvr2.pth \
+        --data_path /path/to/your_data \
+        --eval \
+        --dist_eval
+```
+
+Expected results:
+```
+* Acc 89.437
+```
--- a/beit3/get_started/get_started_for_retrieval.md
+++ b/beit3/get_started/get_started_for_retrieval.md
+# Fine-tuning BEiT-3 on Image-text Retrieval
+
+## COCO Retrieval Setup
+
+1. [Setup environment](../README.md#setup).
+2. Download [2014 train images](http://images.cocodataset.org/zips/train2014.zip), [2014 val images](http://images.cocodataset.org/zips/val2014.zip) and [karpathy split](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip), then organize the dataset as following structure:
+
+```
+/path/to/your_data/
+  train2014/            
+    COCO_train2014_000000000009.jpg                
+    ...
+  val2014/              
+    COCO_val2014_000000000042.jpg
+    ...       
+  dataset_coco.json
+```
+
+We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
+```
+from datasets import RetrievalDataset
+from transformers import XLMRobertaTokenizer
+
+tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
+
+RetrievalDataset.make_coco_dataset_index(
+    data_path="/path/to/your_data",
+    tokenizer=tokenizer,
+)
+```
+
+
+## Flickr30k Retrieval Setup
+
+1. [Setup environment](README.md#setup).
+2. Sign [flickr images request form](https://forms.illinois.edu/sec/229675) and download [karpathy split](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip), then organize the dataset as following structure:
+
+```
+/path/to/your_data/
+  flickr30k-images/            
+    2923475135.jpg                
+    ...      
+  dataset_flickr30k.json
+```
+
+We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
+```
+from datasets import RetrievalDataset
+from transformers import XLMRobertaTokenizer
+
+tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
+
+RetrievalDataset.make_flickr30k_dataset_index(
+    data_path="/path/to/your_data",
+    tokenizer=tokenizer,
+    karpathy_path="/path/to/your_data",
+)
+```
+
+
+## Example: Fine-tuning BEiT-3 on Retrieval
+
+The BEiT-3 **base** model can be finetuned on retrieval tasks using 16 V100-32GB:
+
+```bash       
+python -m torch.distributed.launch --nproc_per_node=16 run_beit3_finetuning.py \
+        --model beit3_base_patch16_384 \
+        --input_size 384 \
+        --task coco_retrieval \
+        --batch_size 192 \
+        --layer_decay 0.65 \
+        --lr 2e-4 \
+        --epochs 15 \
+        --warmup_epochs 3 \
+        --drop_path 0.2 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_itc_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.05 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --enable_deepspeed \
+        --checkpoint_activations
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `192*16 = 3072`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
+- `--task`: **coco_retrieval** for COCO retrieval, **flickr30k** for Flickr30k retrieval
+- `--lr`: 2e-4 for COCO retrieval, 1e-4 for Flickr30k retrieval
+- `--epochs`: 15 for COCO retrieval, 20 for Flickr30k retrieval
+- `--warmup_epochs`: 3 for COCO retrieval, 5 for Flickr30k retrieval
+- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory
+
+
+The BEiT-3 **large** model can be finetuned on retrieval tasks using 2x16 V100-32GB:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=16 --nnodes=2 --node_rank=$NODE_RANK \
+       --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT run_beit3_finetuning.py \
+        --model beit3_large_patch16_384 \
+        --input_size 384 \
+        --task coco_retrieval \
+        --batch_size 96 \
+        --layer_decay 0.85 \
+        --lr 5e-5 \
+        --epochs 15 \
+        --warmup_epochs 3 \
+        --drop_path 0.2 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_itc_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.05 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --enable_deepspeed \
+        --checkpoint_activations
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `96*32 = 3072`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
+- `--task`: **coco_retrieval** for COCO retrieval, **flickr30k** for Flickr30k retrieval
+- `--epochs`: 15 for COCO retrieval, 20 for Flickr30k retrieval
+- `--warmup_epochs`: 3 for COCO retrieval, 5 for Flickr30k retrieval
+- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory
+
+
+## Example: Evaluate BEiT-3 Fine-tuned model on COCO Retrieval and Flickr30k Retrieval
+
+- Get the results of our fine-tuned BEiT3-base model on retrieval tasks using a single GPU:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=1 run_beit3_finetuning.py \
+        --model beit3_base_patch16_384 \
+        --input_size 384 \
+        --task coco_retrieval \
+        --batch_size 16 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_patch16_384_coco_retrieval.pth \
+        --data_path /path/to/your_data \
+        --eval \
+        --dist_eval
+```
+- `--task`: **coco_retrieval** for COCO retrieval, **flickr30k** for Flickr30k retrieval
+- `--finetune`: **beit3_base_patch16_384_coco_retrieval.pth** for COCO retrieval, **beit3_base_patch16_384_f30k_retrieval.pth** for Flickr30k retrieval
+
+- Get the results of our fine-tuned BEiT3-large model on retrieval tasks using a single GPU:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=1 run_beit3_finetuning.py \
+        --model beit3_large_patch16_384 \
+        --input_size 384 \
+        --task coco_retrieval \
+        --batch_size 16 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_patch16_384_coco_retrieval.pth \
+        --data_path /path/to/your_data \
+        --eval \
+        --dist_eval
+```
+- `--task`: **coco_retrieval** for COCO retrieval, **flickr30k** for Flickr30k retrieval
+- `--finetune`: **beit3_large_patch16_384_coco_retrieval.pth** for COCO retrieval, **beit3_large_patch16_384_f30k_retrieval.pth** for Flickr30k retrieval
--- a/beit3/get_started/get_started_for_vqav2.md
+++ b/beit3/get_started/get_started_for_vqav2.md
+# Fine-tuning BEiT-3 on VQAv2 (Visual Question Answering)
+
+
+## Setup
+
+1. [Setup environment](../README.md#setup).
+2. Download COCO [2014 train images](http://images.cocodataset.org/zips/train2014.zip), [2014 val images](http://images.cocodataset.org/zips/val2014.zip), [2015 test images](http://images.cocodataset.org/zips/test2015.zip), annotations ([train](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip), [val](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip)), and questions ([train](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip), [val](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip), [test](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip)), then organize the dataset as following structure:
+
+```
+/path/to/your_data/
+  train2014/            
+    COCO_train2014_000000000009.jpg                
+    ...
+  val2014/              
+    COCO_val2014_000000000042.jpg
+    ...  
+  test2015/              
+    COCO_test2015_000000000001.jpg
+    ...         
+  vqa/
+    v2_OpenEnded_mscoco_train2014_questions.json
+    v2_OpenEnded_mscoco_val2014_questions.json
+    v2_OpenEnded_mscoco_test2015_questions.json
+    v2_OpenEnded_mscoco_test-dev2015_questions.json
+    v2_mscoco_train2014_annotations.json
+    v2_mscoco_val2014_annotations.json
+```
+
+We then generate the index json files using the following command. [beit3.spm](https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D) is the sentencepiece model used for tokenizing texts.
+```
+from datasets import VQAv2Dataset
+from transformers import XLMRobertaTokenizer
+
+tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
+
+VQAv2Dataset.make_dataset_index(
+    data_path="/path/to/your_data",
+    tokenizer=tokenizer,
+    annotation_data_path="/path/to/your_data/vqa",
+)
+```
+
+
+## Example: Fine-tuning BEiT-3 on VQAv2 (Visual Question Answering)
+
+The BEiT-3 **base** model can be finetuned on VQAv2 using 8 V100-32GB:
+
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_base_patch16_480 \
+        --input_size 480 \
+        --task vqav2 \
+        --batch_size 16 \
+        --layer_decay 1.0 \
+        --lr 3e-5 \
+        --update_freq 1 \
+        --randaug \
+        --epochs 10 \
+        --warmup_epochs 1 \
+        --drop_path 0.1 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.01 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --task_head_lr_weight 20 \
+        --opt_betas 0.9 0.98 \
+        --enable_deepspeed
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*16 = 128`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
+- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
+
+
+The BEiT-3 **large** model can be finetuned on VQAv2 using 8 V100-32GB:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_large_patch16_480 \
+        --input_size 480 \
+        --task vqav2 \
+        --batch_size 16 \
+        --layer_decay 1.0 \
+        --lr 2e-5 \
+        --update_freq 1 \
+        --randaug \
+        --epochs 10 \
+        --warmup_epochs 1 \
+        --drop_path 0.15 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_patch16_224.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_model \
+        --log_dir /path/to/save/your_model/log \
+        --weight_decay 0.01 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --task_head_lr_weight 20 \
+        --opt_betas 0.9 0.98 \
+        --enable_deepspeed \
+        --checkpoint_activations
+```
+- `--batch_size`: batch size per GPU. Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*16 = 128`.
+- `--finetune`: weight path of your pretrained models; please download the pretrained model weights in [README.md](../README.md#pretrained-models)
+- `--enable_deepspeed`: optional. If you use apex, please enable deepspeed.
+- `--checkpoint_activations`: using gradient checkpointing for saving GPU memory
+
+
+## Example: Evaluate BEiT-3 Finetuned model on VQAv2 (Visual Question Answering)
+
+- Get the prediction file of the fine-tuned BEiT3-base model on VQAv2 test with 8 V100-32GB:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_base_patch16_480 \
+        --input_size 480 \
+        --task vqav2 \
+        --batch_size 16 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_base_patch16_480_vqa.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_prediction \
+        --eval \
+        --dist_eval
+```
+
+- Get the prediction file of the fine-tuned BEiT3-large model on VQAv2 test with 8 V100-32GB:
+```bash       
+python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
+        --model beit3_large_patch16_480 \
+        --input_size 480 \
+        --task vqav2 \
+        --batch_size 16 \
+        --sentencepiece_model /your_beit3_model_path/beit3.spm \
+        --finetune /your_beit3_model_path/beit3_large_patch16_480_vqa.pth \
+        --data_path /path/to/your_data \
+        --output_dir /path/to/save/your_prediction \
+        --eval \
+        --dist_eval
+```
+
+Please then submit the prediction file in the `output_dir` to the [evaluation server](https://eval.ai/web/challenges/challenge-page/830/overview) to obtain the VQAv2 test-dev and test-std results.
--- a/beit3/glossary.py
+++ b/beit3/glossary.py
+import re
+
+contractions = {
+    "aint": "ain't",
+    "arent": "aren't",
+    "cant": "can't",
+    "couldve": "could've",
+    "couldnt": "couldn't",
+    "couldn'tve": "couldn't've",
+    "couldnt've": "couldn't've",
+    "didnt": "didn't",
+    "doesnt": "doesn't",
+    "dont": "don't",
+    "hadnt": "hadn't",
+    "hadnt've": "hadn't've",
+    "hadn'tve": "hadn't've",
+    "hasnt": "hasn't",
+    "havent": "haven't",
+    "hed": "he'd",
+    "hed've": "he'd've",
+    "he'dve": "he'd've",
+    "hes": "he's",
+    "howd": "how'd",
+    "howll": "how'll",
+    "hows": "how's",
+    "Id've": "I'd've",
+    "I'dve": "I'd've",
+    "Im": "I'm",
+    "Ive": "I've",
+    "isnt": "isn't",
+    "itd": "it'd",
+    "itd've": "it'd've",
+    "it'dve": "it'd've",
+    "itll": "it'll",
+    "let's": "let's",
+    "maam": "ma'am",
+    "mightnt": "mightn't",
+    "mightnt've": "mightn't've",
+    "mightn'tve": "mightn't've",
+    "mightve": "might've",
+    "mustnt": "mustn't",
+    "mustve": "must've",
+    "neednt": "needn't",
+    "notve": "not've",
+    "oclock": "o'clock",
+    "oughtnt": "oughtn't",
+    "ow's'at": "'ow's'at",
+    "'ows'at": "'ow's'at",
+    "'ow'sat": "'ow's'at",
+    "shant": "shan't",
+    "shed've": "she'd've",
+    "she'dve": "she'd've",
+    "she's": "she's",
+    "shouldve": "should've",
+    "shouldnt": "shouldn't",
+    "shouldnt've": "shouldn't've",
+    "shouldn'tve": "shouldn't've",
+    "somebody'd": "somebodyd",
+    "somebodyd've": "somebody'd've",
+    "somebody'dve": "somebody'd've",
+    "somebodyll": "somebody'll",
+    "somebodys": "somebody's",
+    "someoned": "someone'd",
+    "someoned've": "someone'd've",
+    "someone'dve": "someone'd've",
+    "someonell": "someone'll",
+    "someones": "someone's",
+    "somethingd": "something'd",
+    "somethingd've": "something'd've",
+    "something'dve": "something'd've",
+    "somethingll": "something'll",
+    "thats": "that's",
+    "thered": "there'd",
+    "thered've": "there'd've",
+    "there'dve": "there'd've",
+    "therere": "there're",
+    "theres": "there's",
+    "theyd": "they'd",
+    "theyd've": "they'd've",
+    "they'dve": "they'd've",
+    "theyll": "they'll",
+    "theyre": "they're",
+    "theyve": "they've",
+    "twas": "'twas",
+    "wasnt": "wasn't",
+    "wed've": "we'd've",
+    "we'dve": "we'd've",
+    "weve": "we've",
+    "werent": "weren't",
+    "whatll": "what'll",
+    "whatre": "what're",
+    "whats": "what's",
+    "whatve": "what've",
+    "whens": "when's",
+    "whered": "where'd",
+    "wheres": "where's",
+    "whereve": "where've",
+    "whod": "who'd",
+    "whod've": "who'd've",
+    "who'dve": "who'd've",
+    "wholl": "who'll",
+    "whos": "who's",
+    "whove": "who've",
+    "whyll": "why'll",
+    "whyre": "why're",
+    "whys": "why's",
+    "wont": "won't",
+    "wouldve": "would've",
+    "wouldnt": "wouldn't",
+    "wouldnt've": "wouldn't've",
+    "wouldn'tve": "wouldn't've",
+    "yall": "y'all",
+    "yall'll": "y'all'll",
+    "y'allll": "y'all'll",
+    "yall'd've": "y'all'd've",
+    "y'alld've": "y'all'd've",
+    "y'all'dve": "y'all'd've",
+    "youd": "you'd",
+    "youd've": "you'd've",
+    "you'dve": "you'd've",
+    "youll": "you'll",
+    "youre": "you're",
+    "youve": "you've",
+}
+
+manual_map = {
+    "none": "0",
+    "zero": "0",
+    "one": "1",
+    "two": "2",
+    "three": "3",
+    "four": "4",
+    "five": "5",
+    "six": "6",
+    "seven": "7",
+    "eight": "8",
+    "nine": "9",
+    "ten": "10",
+}
+articles = ["a", "an", "the"]
+period_strip = re.compile("(?!<=\d)(\.)(?!\d)")
+comma_strip = re.compile("(\d)(\,)(\d)")
+punct = [
+    ";",
+    r"/",
+    "[",
+    "]",
+    '"',
+    "{",
+    "}",
+    "(",
+    ")",
+    "=",
+    "+",
+    "\\",
+    "_",
+    "-",
+    ">",
+    "<",
+    "@",
+    "`",
+    ",",
+    "?",
+    "!",
+]
+
+
+def normalize_word(token):
+    _token = token
+    for p in punct:
+        if (p + " " in token or " " + p in token) or (
+            re.search(comma_strip, token) != None
+        ):
+            _token = _token.replace(p, "")
+        else:
+            _token = _token.replace(p, " ")
+    token = period_strip.sub("", _token, re.UNICODE)
+
+    _token = []
+    temp = token.lower().split()
+    for word in temp:
+        word = manual_map.setdefault(word, word)
+        if word not in articles:
+            _token.append(word)
+    for i, word in enumerate(_token):
+        if word in contractions:
+            _token[i] = contractions[word]
+    token = " ".join(_token)
+    token = token.replace(",", "")
+    return token
--- a/beit3/modeling_finetune.py
+++ b/beit3/modeling_finetune.py
+# --------------------------------------------------------
+# Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks (https://arxiv.org/abs/2208.10442)
+# Github source: https://github.com/microsoft/unilm/tree/master/beit3
+# Copyright (c) 2023 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------'
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from timm.models.registry import register_model
+import numpy as np
+
+import utils
+from modeling_utils import BEiT3Wrapper, _get_base_config, _get_large_config
+
+
+class TwoLayerMLP(nn.Module):
+    def __init__(
+            self, 
+            in_features, 
+            hidden_features, 
+            out_features, 
+            norm_layer, 
+            norm_input=True, 
+    ):
+        super().__init__()
+        self.norm1 = norm_layer(in_features) if norm_input else nn.Identity()
+        self.dense1 = nn.Linear(in_features, hidden_features)
+        self.norm2 = norm_layer(hidden_features)
+        self.act = nn.GELU()
+        self.dense2 = nn.Linear(hidden_features, out_features)
+
+    def forward(self, x):
+        x = self.norm1(x)
+        x = self.dense1(x)
+        x = self.norm2(x)
+        x = self.act(x)
+        return self.dense2(x)
+
+
+class Pooler(nn.Module):
+    def __init__(self, input_features, output_features, norm_layer):
+        super().__init__()
+        self.norm = norm_layer(input_features)
+        self.dense = nn.Linear(input_features, output_features)
+        self.activation = nn.Tanh()
+
+    def forward(self, x):
+        cls_rep = x[:, 0, :]
+        cls_rep = self.norm(cls_rep)
+        pooled_output = self.dense(cls_rep)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BEiT3ForVisualReasoning(BEiT3Wrapper):
+    def __init__(
+            self, 
+            args, 
+            num_classes, 
+            norm_layer=nn.LayerNorm, 
+            **kwargs
+    ):
+        super(BEiT3ForVisualReasoning, self).__init__(args=args)
+        embed_dim = args.encoder_embed_dim
+        self.head = TwoLayerMLP(
+            in_features=embed_dim * 4, 
+            hidden_features=embed_dim * 2,
+            out_features=num_classes, 
+            norm_layer=norm_layer, 
+        )
+        init_scale = 0.001
+        self.head.apply(self._init_weights)
+        if isinstance(self.head.dense1, nn.Linear):
+            self.head.dense1.weight.data.mul_(init_scale)
+            self.head.dense1.bias.data.mul_(init_scale)
+
+        if isinstance(self.head.dense2, nn.Linear):
+            self.head.dense2.weight.data.mul_(init_scale)
+            self.head.dense2.bias.data.mul_(init_scale)
+
+    def forward(self, image_a, image_b, text_description, padding_mask, **kwargs):
+        bsz, _ = text_description.size()
+        
+        vision_input = torch.cat((image_a, image_b), dim=0)
+        language_input = torch.cat((text_description, text_description), dim=0)
+        padding_mask = torch.cat((padding_mask, padding_mask), dim=0)
+
+        outputs = self.beit3(
+            textual_tokens=language_input, 
+            visual_tokens=vision_input, 
+            text_padding_position=padding_mask, 
+        )
+        x = outputs["encoder_out"]
+        multiway_split_position = outputs["multiway_split_position"]
+
+        vision_cls = x[:, 0, :]
+        language_cls = x[:, multiway_split_position, :]
+        cls_rep = torch.cat((vision_cls, language_cls), dim=-1)
+        a, b = torch.split(cls_rep, split_size_or_sections=[bsz, bsz], dim=0)
+        cls_rep = torch.cat((a, b), dim=-1)
+        return self.head(cls_rep)
+    
+
+class BEiT3ForImageClassification(BEiT3Wrapper):
+    def __init__(
+            self, 
+            args, 
+            num_classes, 
+            norm_layer=nn.LayerNorm, 
+            **kwargs
+    ):
+        super(BEiT3ForImageClassification, self).__init__(args=args)
+        embed_dim = args.encoder_embed_dim
+        self.fc_norm = norm_layer(embed_dim)
+        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+
+        self.fc_norm.apply(self._init_weights)
+        self.head.apply(self._init_weights)
+        init_scale = 0.001
+        if isinstance(self.head, nn.Linear):
+            self.head.weight.data.mul_(init_scale)
+            self.head.bias.data.mul_(init_scale)
+
+    def forward(self, image, **kwargs):
+        x = self.beit3(textual_tokens=None, visual_tokens=image)["encoder_out"]
+        t = x[:, 1:, :]
+        cls_x = self.fc_norm(t.mean(1))
+        return self.head(cls_x)
+
+
+class BEiT3ForCaptioning(BEiT3Wrapper):
+    def __init__(
+            self, 
+            args, 
+            **kwargs
+    ):
+        super(BEiT3ForCaptioning, self).__init__(args=args)
+        embed_dim = args.encoder_embed_dim
+        self.mlm_head = nn.Linear(embed_dim, args.vocab_size)
+        self.mlm_head.apply(self._init_weights)
+
+    def forward(self, image, text_ids, padding_mask, language_masked_pos, text_len=None, incremental_state=None, **kwargs):
+        text_len = text_len if text_len is not None else text_ids.size(1)
+        image_len = self.beit3.vision_embed.num_position_embeddings()
+        max_len = text_len + image_len
+        uni_mask = torch.zeros((max_len, max_len), dtype=torch.long, device=text_ids.device)
+        i_start, i_end = 0, image_len
+        t_start, t_end = image_len, max_len
+        # triangle mask for caption to caption
+        uni_mask[t_start:t_end, t_start:t_end] = torch.tril(torch.ones(text_len, text_len, dtype=torch.long, device=text_ids.device))
+        # full attention for caption to image
+        uni_mask[t_start:t_end, i_start:i_end] = 1
+        # full attention for image to image
+        uni_mask[i_start:i_end, i_start:i_end] = 1
+        uni_mask = 1-uni_mask
+
+        if incremental_state is not None:
+            for idx in range(self.get_num_layers()):
+                if idx not in incremental_state:
+                    incremental_state[idx] = {}
+        
+        # for incremental decoding
+        positions = None
+        if image is None:
+            uni_mask = uni_mask[-2:]
+            padding_mask = None
+            # start position (2 (fairseq starts at 2) + cur_position) is equal to text_len
+            positions = torch.arange(text_len, text_ids.size(1) + text_len, device=text_ids.device).long().unsqueeze(0)
+
+        outputs = self.beit3(
+            textual_tokens=text_ids, 
+            visual_tokens=image, 
+            text_padding_position=padding_mask,
+            attn_mask=uni_mask,
+            incremental_state=incremental_state,
+            positions=positions,
+        )
+        if image is not None:
+            text_feats = outputs["encoder_out"][:, image_len:]
+        else:
+            text_feats = outputs["encoder_out"]
+
+        if language_masked_pos is not None:
+            text_feats = text_feats[language_masked_pos.bool()]
+
+        return self.mlm_head(text_feats), incremental_state
+
+
+class BEiT3ForVisualQuestionAnswering(BEiT3Wrapper):
+    def __init__(
+            self, 
+            args, 
+            num_classes, 
+            norm_layer=nn.LayerNorm, 
+            **kwargs
+    ):
+        super(BEiT3ForVisualQuestionAnswering, self).__init__(args=args)
+        embed_dim = args.encoder_embed_dim
+        self.pooler = Pooler(
+            input_features=embed_dim, 
+            output_features=embed_dim, 
+            norm_layer=norm_layer, 
+        )
+        self.pooler.apply(self._init_weights)
+        self.head = nn.Sequential(
+            nn.Linear(embed_dim, embed_dim * 2), 
+            norm_layer(embed_dim * 2), 
+            nn.GELU(), 
+            nn.Linear(embed_dim * 2, num_classes), 
+        )
+        self.head.apply(self._init_weights)
+
+    def forward(self, image, question, padding_mask, **kwargs):
+        outputs = self.beit3(
+            textual_tokens=question, 
+            visual_tokens=image, 
+            text_padding_position=padding_mask, 
+        )
+        x = outputs["encoder_out"]
+        cls_rep = self.pooler(x)
+        return self.head(cls_rep)
+
+
+class BEiT3ForRetrieval(BEiT3Wrapper):
+    def __init__(
+            self, 
+            args,
+            **kwargs
+    ):
+        super(BEiT3ForRetrieval, self).__init__(args=args)
+        embed_dim = args.encoder_embed_dim
+        self.language_head = nn.Linear(embed_dim, embed_dim, bias=False)
+        self.vision_head = nn.Linear(embed_dim, embed_dim, bias=False)
+        self.language_head.apply(self._init_weights)
+        self.vision_head.apply(self._init_weights)
+        self.criterion = utils.ClipLoss(
+            rank=utils.get_rank(), 
+            world_size=utils.get_world_size(), 
+        )
+        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
+
+    def forward(self, image=None, text_description=None, padding_mask=None, only_infer=False, **kwargs):
+        if image is not None:
+            outputs = self.beit3(
+                textual_tokens=None, 
+                visual_tokens=image, 
+                text_padding_position=None, 
+            )
+            x = outputs["encoder_out"]
+            vision_cls = self.vision_head(x[:, 0, :])
+            vision_cls = F.normalize(vision_cls, dim=-1)
+        else:
+            vision_cls = None
+
+        if text_description is not None:
+            outputs = self.beit3(
+                textual_tokens=text_description, 
+                visual_tokens=None, 
+                text_padding_position=padding_mask, 
+            )
+            x = outputs["encoder_out"]
+            language_cls = self.language_head(x[:, 0, :])
+            language_cls = F.normalize(language_cls, dim=-1)
+        else:
+            language_cls = None
+        
+        if only_infer:
+            return vision_cls, language_cls
+        else:
+            loss, logits_per_image, logits_per_text = self.criterion(
+                vision_cls, language_cls, self.logit_scale.exp())
+            return loss, vision_cls, language_cls
+
+
+@register_model
+def beit3_base_patch16_224_imageclassification(pretrained=False, **kwargs):
+    args = _get_base_config(**kwargs)
+    args.normalize_output = False
+    model = BEiT3ForImageClassification(args, num_classes=1000, **kwargs)
+    return model
+
+
+@register_model
+def beit3_large_patch16_224_imageclassification(pretrained=False, **kwargs):
+    args = _get_large_config(**kwargs)
+    args.normalize_output = False
+    model = BEiT3ForImageClassification(args, num_classes=1000, **kwargs)
+    return model
+
+
+@register_model
+def beit3_base_patch16_224_nlvr2(pretrained=False, **kwargs):
+    args = _get_base_config(**kwargs)
+    model = BEiT3ForVisualReasoning(args, num_classes=2, **kwargs)
+    return model
+
+
+@register_model
+def beit3_large_patch16_224_nlvr2(pretrained=False, **kwargs):
+    args = _get_large_config(**kwargs)
+    model = BEiT3ForVisualReasoning(args, num_classes=2, **kwargs)
+    return model
+
+
+@register_model
+def beit3_base_patch16_384_vqav2(pretrained=False, **kwargs):
+    args = _get_base_config(img_size=384, **kwargs)
+    args.normalize_output = False
+    model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
+    return model
+
+
+@register_model
+def beit3_base_patch16_480_vqav2(pretrained=False, **kwargs):
+    args = _get_base_config(img_size=480, **kwargs)
+    args.normalize_output = False
+    model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
+    return model
+
+
+@register_model
+def beit3_large_patch16_384_vqav2(pretrained=False, **kwargs):
+    args = _get_large_config(img_size=384, **kwargs)
+    args.normalize_output = False
+    model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
+    return model
+
+
+@register_model
+def beit3_large_patch16_480_vqav2(pretrained=False, **kwargs):
+    args = _get_large_config(img_size=480, **kwargs)
+    args.normalize_output = False
+    model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
+    return model
+
+
+@register_model
+def beit3_large_patch16_768_vqav2(pretrained=False, **kwargs):
+    args = _get_large_config(img_size=768, **kwargs)
+    args.normalize_output = False
+    model = BEiT3ForVisualQuestionAnswering(args, num_classes=3129, **kwargs)
+    return model
+
+
+@register_model
+def beit3_base_patch16_224_captioning(pretrained=False, **kwargs):
+    args = _get_base_config(**kwargs)
+    model = BEiT3ForCaptioning(args, **kwargs)
+    return model
+
+
+@register_model
+def beit3_base_patch16_480_captioning(pretrained=False, **kwargs):
+    args = _get_base_config(img_size=480, **kwargs)
+    model = BEiT3ForCaptioning(args, **kwargs)
+    return model
+
+
+@register_model
+def beit3_large_patch16_480_captioning(pretrained=False, **kwargs):
+    args = _get_large_config(img_size=480, **kwargs)
+    model = BEiT3ForCaptioning(args, **kwargs)
+    return model
+
+
+@register_model
+def beit3_base_patch16_224_retrieval(pretrained=False, **kwargs):
+    args = _get_base_config(**kwargs)
+    model = BEiT3ForRetrieval(args, **kwargs)
+    return model
+
+
+@register_model
+def beit3_base_patch16_384_retrieval(pretrained=False, **kwargs):
+    args = _get_base_config(img_size=384, **kwargs)
+    model = BEiT3ForRetrieval(args, **kwargs)
+    return model
+
+
+@register_model
+def beit3_large_patch16_384_retrieval(pretrained=False, **kwargs):
+    args = _get_large_config(img_size=384, **kwargs)
+    model = BEiT3ForRetrieval(args, **kwargs)
+    return model
--- a/beit3/modeling_utils.py
+++ b/beit3/modeling_utils.py
+# --------------------------------------------------------
+# Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks (https://arxiv.org/abs/2208.10442)
+# Github source: https://github.com/microsoft/unilm/tree/master/beit3
+# Copyright (c) 2023 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------'
+
+import math
+import torch
+import torch.nn as nn
+from timm.models.layers import trunc_normal_ as __call_trunc_normal_
+
+from torchscale.model.BEiT3 import BEiT3
+from torchscale.architecture.config import EncoderConfig
+
+
+def trunc_normal_(tensor, mean=0., std=1.):
+    __call_trunc_normal_(tensor, mean=mean, std=std, a=-std, b=std)
+
+
+def _get_base_config(
+        img_size=224, patch_size=16, drop_path_rate=0, 
+        checkpoint_activations=None, mlp_ratio=4, vocab_size=64010, **kwargs
+):
+    return EncoderConfig(
+        img_size=img_size, patch_size=patch_size, vocab_size=vocab_size, multiway=True, 
+        layernorm_embedding=False, normalize_output=True, no_output_layer=True, 
+        drop_path_rate=drop_path_rate, encoder_embed_dim=768, encoder_attention_heads=12, 
+        encoder_ffn_embed_dim=int(768 * mlp_ratio), encoder_layers=12, 
+        checkpoint_activations=checkpoint_activations, 
+    )
+
+
+def _get_large_config(
+        img_size=224, patch_size=16, drop_path_rate=0, 
+        checkpoint_activations=None, mlp_ratio=4, vocab_size=64010, **kwargs
+):
+    return EncoderConfig(
+        img_size=img_size, patch_size=patch_size, vocab_size=vocab_size, multiway=True, 
+        layernorm_embedding=False, normalize_output=True, no_output_layer=True, 
+        drop_path_rate=drop_path_rate, encoder_embed_dim=1024, encoder_attention_heads=16, 
+        encoder_ffn_embed_dim=int(1024 * mlp_ratio), encoder_layers=24, 
+        checkpoint_activations=checkpoint_activations, 
+    )
+
+
+class BEiT3Wrapper(nn.Module):
+    def __init__(self, args, **kwargs):
+        super().__init__()
+        self.args = args
+        self.beit3 = BEiT3(args)
+        self.apply(self._init_weights)
+
+    def fix_init_weight(self):
+        def rescale(param, layer_id):
+            param.div_(math.sqrt(2.0 * layer_id))
+
+        for layer_id, layer in enumerate(self.blocks):
+            rescale(layer.attn.proj.weight.data, layer_id + 1)
+            rescale(layer.mlp.fc2.weight.data, layer_id + 1)
+
+    def get_num_layers(self):
+        return self.beit3.encoder.num_layers
+
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed', 'cls_token', 'beit3.encoder.embed_positions.A.weight', 'beit3.vision_embed.cls_token', 'logit_scale'}
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
--- a/beit3/optim_factory.py
+++ b/beit3/optim_factory.py
+# --------------------------------------------------------
+# Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks (https://arxiv.org/abs/2208.10442)
+# Github source: https://github.com/microsoft/unilm/tree/master/beit3
+# Copyright (c) 2023 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------'
+
+from torch import optim as optim
+from timm.optim.lookahead import Lookahead
+
+import json
+
+
+def get_num_layer_for_vit(var_name, num_max_layer):
+    if "embed" in var_name:
+        return 0
+    elif var_name in (
+        "cls_token", "mask_token", "pos_embed", "language_pos_embed", 
+        "word_embeddings.weight", "vision_cls_token", "vision_pos_embed"
+    ):
+        return 0
+    elif var_name.startswith("patch_embed"):
+        return 0
+    elif var_name.startswith("rel_pos_bias"):
+        return num_max_layer - 1
+    elif "layers." in var_name:
+        layer_id = int(var_name.split('layers.')[1].split('.')[0])
+        return layer_id + 1
+    else:
+        return num_max_layer - 1
+
+
+def get_is_head_flag_for_vit(var_name, num_max_layer):
+    if var_name.startswith("head"):
+        return 1
+    # elif var_name.startswith("pooler"):
+    #     return 1
+    else:
+        return 0
+
+
+class LayerDecayValueAssigner(object):
+    def __init__(self, values, scale_handler=None):
+        self.scale_handler = scale_handler or get_num_layer_for_vit
+        self.values = values
+
+    def get_scale(self, layer_id):
+        return self.values[layer_id]
+
+    def get_layer_id(self, var_name):
+        return self.scale_handler(var_name, len(self.values))
+
+
+# The implementation code is modified from Timm (https://github.com/huggingface/pytorch-image-models/tree/main/timm
+def get_parameter_groups(model, weight_decay=1e-5, skip_list=(), get_num_layer=None, get_layer_scale=None):
+    parameter_group_names = {}
+    parameter_group_vars = {}
+
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue  # frozen weights
+        if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
+            group_name = "no_decay"
+            this_weight_decay = 0.
+        else:
+            group_name = "decay"
+            this_weight_decay = weight_decay
+        if get_num_layer is not None:
+            layer_id = get_num_layer(name)
+            group_name = "layer_%d_%s" % (layer_id, group_name)
+        else:
+            layer_id = None
+
+        if group_name not in parameter_group_names:
+            if get_layer_scale is not None:
+                scale = get_layer_scale(layer_id)
+            else:
+                scale = 1.
+
+            parameter_group_names[group_name] = {
+                "weight_decay": this_weight_decay,
+                "params": [],
+                "lr_scale": scale
+            }
+            parameter_group_vars[group_name] = {
+                "weight_decay": this_weight_decay,
+                "params": [],
+                "lr_scale": scale
+            }
+
+        parameter_group_vars[group_name]["params"].append(param)
+        parameter_group_names[group_name]["params"].append(name)
+    print("Param groups = %s" % json.dumps(parameter_group_names, indent=2))
+    return list(parameter_group_vars.values())
+
+
+def create_optimizer(args, model, get_num_layer=None, get_layer_scale=None, filter_bias_and_bn=True, skip_list=None):
+    opt_lower = args.opt.lower()
+    weight_decay = args.weight_decay
+    if weight_decay and filter_bias_and_bn:
+        skip = {}
+        if skip_list is not None:
+            skip = skip_list
+        elif hasattr(model, 'no_weight_decay'):
+            skip = model.no_weight_decay()
+        parameters = get_parameter_groups(model, weight_decay, skip, get_num_layer, get_layer_scale)
+        weight_decay = 0.
+    else:
+        parameters = model.parameters()
+
+    opt_args = dict(lr=args.lr, weight_decay=weight_decay)
+    if hasattr(args, 'opt_eps') and args.opt_eps is not None:
+        opt_args['eps'] = args.opt_eps
+    if hasattr(args, 'opt_betas') and args.opt_betas is not None:
+        opt_args['betas'] = args.opt_betas
+
+    opt_split = opt_lower.split('_')
+    opt_lower = opt_split[-1]
+    if opt_lower == 'adamw':
+        optimizer = optim.AdamW(parameters, **opt_args)
+    else:
+        raise ValueError("Invalid optimizer")
+
+    if len(opt_split) > 1:
+        if opt_split[0] == 'lookahead':
+            optimizer = Lookahead(optimizer)
+
+    return optimizer
--- a/beit3/randaug.py
+++ b/beit3/randaug.py
+import cv2
+import numpy as np
+
+
+## aug functions
+def identity_func(img):
+    return img
+
+
+def autocontrast_func(img, cutoff=0):
+    '''
+        same output as PIL.ImageOps.autocontrast
+    '''
+    n_bins = 256
+
+    def tune_channel(ch):
+        n = ch.size
+        cut = cutoff * n // 100
+        if cut == 0:
+            high, low = ch.max(), ch.min()
+        else:
+            hist = cv2.calcHist([ch], [0], None, [n_bins], [0, n_bins])
+            low = np.argwhere(np.cumsum(hist) > cut)
+            low = 0 if low.shape[0] == 0 else low[0]
+            high = np.argwhere(np.cumsum(hist[::-1]) > cut)
+            high = n_bins - 1 if high.shape[0] == 0 else n_bins - 1 - high[0]
+        if high <= low:
+            table = np.arange(n_bins)
+        else:
+            scale = (n_bins - 1) / (high - low)
+            offset = -low * scale
+            table = np.arange(n_bins) * scale + offset
+            table[table < 0] = 0
+            table[table > n_bins - 1] = n_bins - 1
+        table = table.clip(0, 255).astype(np.uint8)
+        return table[ch]
+
+    channels = [tune_channel(ch) for ch in cv2.split(img)]
+    out = cv2.merge(channels)
+    return out
+
+
+def equalize_func(img):
+    '''
+        same output as PIL.ImageOps.equalize
+        PIL's implementation is different from cv2.equalize
+    '''
+    n_bins = 256
+
+    def tune_channel(ch):
+        hist = cv2.calcHist([ch], [0], None, [n_bins], [0, n_bins])
+        non_zero_hist = hist[hist != 0].reshape(-1)
+        step = np.sum(non_zero_hist[:-1]) // (n_bins - 1)
+        if step == 0: return ch
+        n = np.empty_like(hist)
+        n[0] = step // 2
+        n[1:] = hist[:-1]
+        table = (np.cumsum(n) // step).clip(0, 255).astype(np.uint8)
+        return table[ch]
+
+    channels = [tune_channel(ch) for ch in cv2.split(img)]
+    out = cv2.merge(channels)
+    return out
+
+
+def rotate_func(img, degree, fill=(0, 0, 0)):
+    '''
+    like PIL, rotate by degree, not radians
+    '''
+    H, W = img.shape[0], img.shape[1]
+    center = W / 2, H / 2
+    M = cv2.getRotationMatrix2D(center, degree, 1)
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill)
+    return out
+
+
+def solarize_func(img, thresh=128):
+    '''
+        same output as PIL.ImageOps.posterize
+    '''
+    table = np.array([el if el < thresh else 255 - el for el in range(256)])
+    table = table.clip(0, 255).astype(np.uint8)
+    out = table[img]
+    return out
+
+
+def color_func(img, factor):
+    '''
+        same output as PIL.ImageEnhance.Color
+    '''
+    ## implementation according to PIL definition, quite slow
+    #  degenerate = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)[:, :, np.newaxis]
+    #  out = blend(degenerate, img, factor)
+    #  M = (
+    #      np.eye(3) * factor
+    #      + np.float32([0.114, 0.587, 0.299]).reshape(3, 1) * (1. - factor)
+    #  )[np.newaxis, np.newaxis, :]
+    M = (
+            np.float32([
+                [0.886, -0.114, -0.114],
+                [-0.587, 0.413, -0.587],
+                [-0.299, -0.299, 0.701]]) * factor
+            + np.float32([[0.114], [0.587], [0.299]])
+    )
+    out = np.matmul(img, M).clip(0, 255).astype(np.uint8)
+    return out
+
+
+def contrast_func(img, factor):
+    """
+        same output as PIL.ImageEnhance.Contrast
+    """
+    mean = np.sum(np.mean(img, axis=(0, 1)) * np.array([0.114, 0.587, 0.299]))
+    table = np.array([(
+        el - mean) * factor + mean
+        for el in range(256)
+    ]).clip(0, 255).astype(np.uint8)
+    out = table[img]
+    return out
+
+
+def brightness_func(img, factor):
+    '''
+        same output as PIL.ImageEnhance.Contrast
+    '''
+    table = (np.arange(256, dtype=np.float32) * factor).clip(0, 255).astype(np.uint8)
+    out = table[img]
+    return out
+
+
+def sharpness_func(img, factor):
+    '''
+    The differences the this result and PIL are all on the 4 boundaries, the center
+    areas are same
+    '''
+    kernel = np.ones((3, 3), dtype=np.float32)
+    kernel[1][1] = 5
+    kernel /= 13
+    degenerate = cv2.filter2D(img, -1, kernel)
+    if factor == 0.0:
+        out = degenerate
+    elif factor == 1.0:
+        out = img
+    else:
+        out = img.astype(np.float32)
+        degenerate = degenerate.astype(np.float32)[1:-1, 1:-1, :]
+        out[1:-1, 1:-1, :] = degenerate + factor * (out[1:-1, 1:-1, :] - degenerate)
+        out = out.astype(np.uint8)
+    return out
+
+
+def shear_x_func(img, factor, fill=(0, 0, 0)):
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, factor, 0], [0, 1, 0]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+
+
+def translate_x_func(img, offset, fill=(0, 0, 0)):
+    '''
+        same output as PIL.Image.transform
+    '''
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, 0, -offset], [0, 1, 0]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+
+
+def translate_y_func(img, offset, fill=(0, 0, 0)):
+    '''
+        same output as PIL.Image.transform
+    '''
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, 0, 0], [0, 1, -offset]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+
+
+def posterize_func(img, bits):
+    '''
+        same output as PIL.ImageOps.posterize
+    '''
+    out = np.bitwise_and(img, np.uint8(255 << (8 - bits)))
+    return out
+
+
+def shear_y_func(img, factor, fill=(0, 0, 0)):
+    H, W = img.shape[0], img.shape[1]
+    M = np.float32([[1, 0, 0], [factor, 1, 0]])
+    out = cv2.warpAffine(img, M, (W, H), borderValue=fill, flags=cv2.INTER_LINEAR).astype(np.uint8)
+    return out
+
+
+def cutout_func(img, pad_size, replace=(0, 0, 0)):
+    replace = np.array(replace, dtype=np.uint8)
+    H, W = img.shape[0], img.shape[1]
+    rh, rw = np.random.random(2)
+    pad_size = pad_size // 2
+    ch, cw = int(rh * H), int(rw * W)
+    x1, x2 = max(ch - pad_size, 0), min(ch + pad_size, H)
+    y1, y2 = max(cw - pad_size, 0), min(cw + pad_size, W)
+    out = img.copy()
+    out[x1:x2, y1:y2, :] = replace
+    return out
+
+
+### level to args
+def enhance_level_to_args(MAX_LEVEL):
+    def level_to_args(level):
+        return ((level / MAX_LEVEL) * 1.8 + 0.1,)
+    return level_to_args
+
+
+def shear_level_to_args(MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = (level / MAX_LEVEL) * 0.3
+        if np.random.random() > 0.5: level = -level
+        return (level, replace_value)
+
+    return level_to_args
+
+
+def translate_level_to_args(translate_const, MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = (level / MAX_LEVEL) * float(translate_const)
+        if np.random.random() > 0.5: level = -level
+        return (level, replace_value)
+
+    return level_to_args
+
+
+def cutout_level_to_args(cutout_const, MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = int((level / MAX_LEVEL) * cutout_const)
+        return (level, replace_value)
+
+    return level_to_args
+
+
+def solarize_level_to_args(MAX_LEVEL):
+    def level_to_args(level):
+        level = int((level / MAX_LEVEL) * 256)
+        return (level, )
+    return level_to_args
+
+
+def none_level_to_args(level):
+    return ()
+
+
+def posterize_level_to_args(MAX_LEVEL):
+    def level_to_args(level):
+        level = int((level / MAX_LEVEL) * 4)
+        return (level, )
+    return level_to_args
+
+
+def rotate_level_to_args(MAX_LEVEL, replace_value):
+    def level_to_args(level):
+        level = (level / MAX_LEVEL) * 30
+        if np.random.random() < 0.5:
+            level = -level
+        return (level, replace_value)
+
+    return level_to_args
+
+
+func_dict = {
+    'Identity': identity_func,
+    'AutoContrast': autocontrast_func,
+    'Equalize': equalize_func,
+    'Rotate': rotate_func,
+    'Solarize': solarize_func,
+    'Color': color_func,
+    'Contrast': contrast_func,
+    'Brightness': brightness_func,
+    'Sharpness': sharpness_func,
+    'ShearX': shear_x_func,
+    'TranslateX': translate_x_func,
+    'TranslateY': translate_y_func,
+    'Posterize': posterize_func,
+    'ShearY': shear_y_func,
+}
+
+translate_const = 10
+MAX_LEVEL = 10
+replace_value = (128, 128, 128)
+arg_dict = {
+    'Identity': none_level_to_args,
+    'AutoContrast': none_level_to_args,
+    'Equalize': none_level_to_args,
+    'Rotate': rotate_level_to_args(MAX_LEVEL, replace_value),
+    'Solarize': solarize_level_to_args(MAX_LEVEL),
+    'Color': enhance_level_to_args(MAX_LEVEL),
+    'Contrast': enhance_level_to_args(MAX_LEVEL),
+    'Brightness': enhance_level_to_args(MAX_LEVEL),
+    'Sharpness': enhance_level_to_args(MAX_LEVEL),
+    'ShearX': shear_level_to_args(MAX_LEVEL, replace_value),
+    'TranslateX': translate_level_to_args(
+        translate_const, MAX_LEVEL, replace_value
+    ),
+    'TranslateY': translate_level_to_args(
+        translate_const, MAX_LEVEL, replace_value
+    ),
+    'Posterize': posterize_level_to_args(MAX_LEVEL),
+    'ShearY': shear_level_to_args(MAX_LEVEL, replace_value),
+}
+
+
+class RandomAugment(object):
+
+    def __init__(self, N=2, M=10, isPIL=False, augs=[]):
+        self.N = N
+        self.M = M
+        self.isPIL = isPIL
+        if augs:
+            self.augs = augs       
+        else:
+            self.augs = list(arg_dict.keys())
+
+    def get_random_ops(self):
+        sampled_ops = np.random.choice(self.augs, self.N)
+        return [(op, 0.5, self.M) for op in sampled_ops]
+
+    def __call__(self, img):
+        if self.isPIL:
+            img = np.array(img)            
+        ops = self.get_random_ops()
+        for name, prob, level in ops:
+            if np.random.random() > prob:
+                continue
+            args = arg_dict[name](level)
+            img = func_dict[name](img, *args) 
+        return img
+
+
+if __name__ == '__main__':
+    a = RandomAugment()
+    img = np.random.randn(32, 32, 3)
+    a(img)
--- a/beit3/requirements.txt
+++ b/beit3/requirements.txt
+timm==0.4.12
+Pillow
+blobfile
+mypy
+numpy
+pytest
+requests
+einops
+tensorboardX
+scipy
+ftfy
+opencv-python
+sentencepiece
+pyarrow
+torchmetrics==0.7.3
+transformers
+pycocotools
+pycocoevalcap
+torchscale==0.2.0
--- a/beit3/run_beit3_finetuning.py
+++ b/beit3/run_beit3_finetuning.py
--- a/beit3/train.sh
+++ b/beit3/train.sh
+#!/bin/bash/
+
+export HIP_VISIBLE_DEVICES=0,1,2,3 # 自行修改为训练的卡号和数量
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export USE_MIOPEN_BATCHNORM=1
+
+python -m torch.distributed.launch --nproc_per_node=4 run_beit3_finetuning.py \
+        --model beit3_base_patch16_480 \
+        --input_size 480 \
+        --task coco_captioning \
+        --batch_size 32 \
+        --layer_decay 1.0 \
+        --lr 4e-5 \
+        --randaug \
+        --epochs 10 \
+        --warmup_epochs 1 \
+        --drop_path 0.1 \
+        --sentencepiece_model ./pretrained_models/beit3.spm \
+        --finetune ./pretrained_models/beit3_base_patch16_224.pth \
+        --data_path /home/data/coco2014 \
+        --output_dir ./save_models/ \
+        --log_dir ./logs \
+        --weight_decay 0.05 \
+        --seed 42 \
+        --save_ckpt_freq 5 \
+        --num_max_bpe_tokens 32 \
+        --captioning_mask_prob 0.7 \
+        --drop_worst_after 12000 \
+        --dist_eval \
+        --checkpoint_activations \
+        --enable_deepspeed
\ No newline at end of file
--- a/beit3/utils.py
+++ b/beit3/utils.py
--- a/beit3/val.sh
+++ b/beit3/val.sh
+#!/bin/bash/
+
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export USE_MIOPEN_BATCHNORM=1
+
+python -m torch.distributed.launch --nproc_per_node=4 run_beit3_finetuning.py \
+        --model beit3_base_patch16_480 \
+        --input_size 480 \
+        --task coco_captioning \
+        --batch_size 16 \
+        --sentencepiece_model ./pretrained_models/beit3.spm \
+        --finetune ./pretrained_models/beit3_base_patch16_480_coco_captioning.pth \
+        --data_path ../../data/coco2014/ \
+        --output_dir ./save_models \
+        --eval \
+        --dist_eval