Commit 799a38c5 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Pipeline #616 failed with stages
in 0 seconds
.idea/*
.DS_Store
.vscode
\ No newline at end of file
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 1999-2022 Alibaba Group Holding Ltd.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# OFA
本项目的原理、步骤适用于OFA中的Image Captioning算法,OFA项目中的其它算法使用方法类似。
## 论文
`OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework`
- https://arxiv.org/pdf/2202.03052.pdf
## 模型结构
先将图像用卷积进行分块以降低计算量,再对每一块进行展平处理变成序列,然后将图像序列与NLP序列一起放入encoder编码,再将encoder编码与target在decoder中一起提取特征输出预测结果,整体结构由encoder-decoder组成。
<div align=center>
<img src="./doc/structure.png"/>
</div>
## 算法原理
借鉴《Transformer is all you need!》算法论文中的Transformer结构,利用注意力模块attention提取特征,本文的核心思想是将文本、图像、检测目标用统一的词表进行序列编码,然后就可以用同一个模型结构训练、预测,从而使模型具有更强的通用性。
<div align=center>
<img src="./doc/theory.png"/>
</div>
## 环境配置
```
mv OFA_pytorch OFA # 去框架名后缀
mkdir -p OFA/checkpoints
../../checkpoints/ofa_large.pt # finetune训练前,下载预训练权重ofa_large.pt到checkpoints文件夹下。
```
- https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_large.pt
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
# <your IMAGE ID>用以上拉取的docker的镜像ID替换
docker run --shm-size=32G --name=ofa --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/OFA:/home/OFA -it <your IMAGE ID> bash
pip install -r requirements.txt
cp -r OFA/nltk_data /root/ # 放置nltk库需要加载的.zip压缩包
```
### Dockerfile(方法二)
```
cd OFA/docker
docker build --no-cache -t ofa:latest .
docker run --shm-size=32G --name=ofa --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../OFA:/home/OFA -it ofa bash
# 若遇到Dockerfile启动的方式安装环境需要长时间等待,可注释掉里面的pip安装,启动容器后再安装python库:pip install -r requirements.txt
cp -r OFA/nltk_data /root/ # 放置nltk库需要加载的.zip压缩包
cd OFA && pip install -e ./fairseq/
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
- https://developer.hpccube.com/tool/
```
DTK驱动:dtk23.04
python:python3.8
torch:1.13.1
torchvision:0.14.1
torchaudio:0.13.1
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应,fairseq只能使用项目中自带的经开源作者改造的版本(v1.0.0)。`
2、其它非特殊库参照requirements.txt安装
```
pip install -r requirements.txt
cp -r OFA/nltk_data /root/ # 放置nltk库需要加载的.zip压缩包
```
## 数据集
OFA所用数据来自大量公开数据集:
<div align=center>
<img src="./doc/datasets.png"/>
</div>
本项目主要用到`COCO`
- https://cocodataset.org/#download
虽然项目中所用到的数据集皆来自公开数据集,但源作者根据算法的特点进行了改造定制,训练推理需要下载以下数据集,目前暂未开源数据处理代码,未来将会开源。
- https://ofa-beijing.oss-cn-beijing.aliyuncs.com/datasets/caption_data/caption_data.zip
训练数据目录结构如下,将caption_data.zip解压到以下目录即可正常训练:
```
OFA/
├── dataset/
│ ├── caption_data/
│ │ ├── caption_stage1_train.tsv
│ │ ├── caption_stage2_train.tsv
│ │ ├── caption_test.tsv
│ │ ├── caption_val.tsv
│ │ ├── test_caption_coco_format.json
│ │ └── cider_cached_tokens/
│ │ ├── coco-test-words.p
│ ├── coco-train-words.p
│ │ └── coco-valid-words.p
│ │
│ └── xxx_data/
```
`更多资料可参考源项目的README_origin.md`
## 训练
### 单机多卡
```
cd OFA/run_scripts/caption
nohup sh train_caption_stage1.sh > train_stage1.out & # stage 1, train with cross-entropy loss
cp stage1_checkpoints/2_0.06_2500/checkpoint_best.pt ../../checkpoints/caption_stage1_best.pt
nohup sh train_caption_stage2.sh > train_stage2.out & # stage 2, load the best ckpt of stage1 and train with CIDEr optimization
```
## 推理
前文中的fairseq版本无法成功推理,此处需要重新安装,且github上fairseq开源的官方代码也可能无法安装成功。建议按以下方式安装:
```
pip install fairseq==0.12.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
```
```
cp stage2_checkpoints/1e-5_3/checkpoint_best.pt ../../checkpoints/caption_large_best_clean.pt
cd ../../
python caption_infer.py # 来自colab.md下的Image Captioning
```
## result
输入图片:
<div align=center>
<img src="./doc/test.png"/>
</div>
输出文字描述:
```
a row of houses on a street.
```
### 精度
测试数据:[caption_data]("./dataset/caption_data/caption_test.tsv")中的test数据,推理框架:pytorch。
| device | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | METEOR | ROUGE_L | CIDEr | SPICE |
|:--------:| :------: | :------: | :------: |:------: | :------: | :------: | :------: |:------: |
| DCU Z100 | 0.836 | 0.694 | 0.555 | 0.434 | 0.320 | 0.622 | 1.484 | 0.259 |
| GPU A800 | 0.840 | 0.697 | 0.556 | 0.434 | 0.319 | 0.622 | 1.488 | 0.258 |
## 应用场景
### 算法类别
`图像理解`
### 热点应用行业
`零售,广媒,制造,家居,政府`
## 预训练权重
从OFA/checkpoints.md下的Pretraining下载作者的开源large版本预训练权重
- https://github.com/OFA-Sys/OFA/blob/main/checkpoints.md
## 源码仓库及问题反馈
- https://developer.hpccube.com/codes/modelzoo/ofa_pytorch
## 参考资料
- https://github.com/OFA-Sys/OFA
# Finetuning with Encouraging Loss (EL)
Below we provide methods for finetuning with label smoothed encouraging loss proposed in [_Well-classified Examples are Underestimated in Classification with Deep Neural Networks_](https://arxiv.org/pdf/2110.06537.pdf) on different downstream tasks.
The implementation is in [label_smoothed_encouraging_loss.py](criterions/label_smoothed_encouraging_loss.py).
You can set the `--criterion` to `adjust_label_smoothed_encouraging_loss` to use it. This criterion has a hyper-parameter `--log-end`.
`--log-end < 1` results in a approximated and conservative version of the full encouraging loss.
A high log_end will more strongly weaken the gradient vanishing, enhance the modeling of the data, and increase the growth rate of the margin, but it will also bring a larger gradient norm, which will bring challenges to the existing optimization system.
We recommend higher log_end for cases with higher performance, and 0.75 or 0.5 as your first try.
## Image Captioning
We provide procedures for image captioning with EL below. The preprocessing is identical to default setting.
<details>
<summary><b>Finetuning</b></summary>
<p>
We propose two scripts for stage1. </b>
</p>
<pre>
cd run_scripts/caption
nohup sh train_caption_stage1_el.sh > train_stage1_el.out & # stage 1, train with encouraging loss, expected cider 1.403
nohup sh train_caption_stage1_el_db.sh > train_stage1_el.out & # stage 1, train with encouraging loss, and drop best examples, expected cider 1.404
</pre>
</details>
## Referring Expression Comprehension
We provide procedures for image captioning with EL below. The preprocessing is identical to default setting.
<details>
<summary><b>Finetuning</b></summary>
<pre>
cd run_scripts/refcoco
nohup sh train_refcoco_el.sh > train_refcoco_el.out & # finetune for refcoco
nohup sh train_refcocoplus_el.sh > train_refcocoplus_el.out & # finetune for refcoco+
nohup sh train_refcocog_el.sh > train_refcocog_el.out & # finetune for refcocog
</pre>
</details>
Evaluation is also the same as the default setting.
# MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for speech recognition
<p align="center">
<a href="modelscope.md">ModelScope</a>&nbsp | &nbsp<a href="https://arxiv.org/abs/2212.00500">Paper </a>&nbsp
</p>
We propose a novel multi-modal multi-task encoder-decoder pre-training framework~(MMSpeech) for Mandarin automatic speech recognition~(ASR), which employs a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
<p align="center">
<br>
<img src="examples/mmspeech.png" width="700" />
<br>
<p>
<br>
## Datasets & Checkpoints
| Model | Model Size | Unlabeled Speech | Unlabeled Text | labeled | Pre-Training | Fine-Tuning |
|:---------------|:----------:|:--------------------------------------------------:|:---------------------------------------------:|:----------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|
| MMSpeech-Base1 | 210M | [AISHELL-2](https://www.aishelltech.com/aishell_2) | [M6-Corpus](https://arxiv.org/abs/2103.00823) | [AISHELL-1](http://www.openslr.org/33/) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base1_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base1_aishell1.pt) |
| MMSpeech-Base2 | 210M | [WenetSpeech](https://wenet.org.cn/WenetSpeech/) | M6-Corpus | AISHELL-1 | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base2_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base2_aishell1.pt) |
| MMSpeech-Large | 609M | WenetSpeech | M6-Corpus | AISHELL-1 | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_large_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_large_aishell1.pt) |
## Results on AISHELL-1
- Compare MMSpeech-Base1 with the model of the same encoder size and amount of unlabeled speech data.
| Model | dev (w/o LM) | dev (wit LM) | test (w/o LM) | test (with LM) |
|:---------------------------------|:------------:|:------------:|:-------------:|:--------------:|
| w/o pre-training | 6.4 | 5.2 | 6.8 | 5.7 |
| Data2Vec | 3.8 | 3.7 | 4.1 | 3.9 |
| MMSpeech-Base1 | 2.4 | 2.1 | 2.6 | 2.3 |
| MMSpeech-Base1 (w/o Fine-Tuning) | 2.5 | 2.3 | 2.6 | 2.3 |
- Compare MMSpeech-Base2 with the model of the same encoder size and amount of unlabeled speech data.
| Model | dev (wit LM) | test (with LM) |
|:-----------------|:------------:|:--------------:|
| Wav2vec 2.0-Base | 4.2 | 4.7 |
| HuBERT-Base | 4.1 | 4.3 |
| MMSpeech-Base2 | 2.0 | 2.1 |
- Compare MMSpeech-Large with the model of the same encoder size and amount of unlabeled speech data.
| Model | dev (wit LM) | test (with LM) |
|:------------------|:------------:|:--------------:|
| Wav2vec 2.0-Large | 3.8 | 4.1 |
| HuBERT-Large | 3.1 | 3.3 |
| MMSpeech-Large | 1.6 | 1.9 |
## Quick start
### Installation
Note that we update the fairseq version for mmspeech.
```bash
git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt
```
### Data preparation
Input files for all tasks include three columns: "speech_id, wav_path, text", delimited by a "\t".
- "wav_path" denotes the path for the wav files.
- "text" denotes raw text inputs.
- "pseduo-codes" can be obtained by following the steps in [wav2seq](https://github.com/asappresearch/wav2seq).
| Data | Task | speech_id_col | wav_path_col | text_col |
|:----------------------|:--------:|:-------------:|:------------:|:------------:|
| unlabeled speech data | S2C, MSP | speech_id | wav_path | pseduo-codes |
| unlabeled text data | P2T | speech_id | un-used | text |
| speech-text data | S2T | speech_id | wav_path | text |
We also provide example config_yaml of input fbank features for your reference in [here](http://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/data/fbank_config.yaml).
### training
```commandline
cd run_scripts/mmspeech
sh mmspeech_cn_base_stage1.sh
sh mmspeech_cn_base_stage2.sh
sh mmspeech_cn_base_stage3.sh
```
### evaluation
```commandline
cd run_scripts/mmspeech
sh evaluate_mmspeech_base.sh
```
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
# Checkpoints
We provide links for you to download our checkpoints, including pretrained and finetuned models on different tasks. If you would like to use OFA with Transformers, please download checkpoints at [https://huggingface.co/OFA-Sys](https://huggingface.co/OFA-Sys), and check the code in the branch `feature/add_transformers`.
## Pretraining
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_huge.pt"> Pre-trained checkpoint (OFA-Huge) </a> (~930M parameters)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_large.pt"> Pre-trained checkpoint (OFA-Large) </a> (~470M parameters)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_base.pt"> Pre-trained checkpoint (OFA-Base) </a> (~180M parameters)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_medium.pt"> Pre-trained checkpoint (OFA-Medium) </a> (~93M parameters)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_tiny.pt"> Pre-trained checkpoint (OFA-Tiny) </a> (~33M parameters)
## Finetuning (OFA-Huge)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_huge_best.pt"> Finetuned checkpoint for Caption on COCO </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/vqa_huge_best.pt"> Finetuned checkpoint for VQAv2 </a>
## Finetuning (OFA-Large)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_large_best_clean.pt"> Finetuned checkpoint for Caption on COCO </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_stage1_best.pt"> Finetuned checkpoint for Caption on COCO During Stage1 Finetuning </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_large_best.pt"> Finetuned checkpoint for RefCOCO </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocoplus_large_best.pt"> Finetuned checkpoint for RefCOCO+ </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocog_large_best.pt"> Finetuned checkpoint for RefCOCOg </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/vqa_large_best.pt"> Finetuned checkpoint for VQAv2 </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/snli_ve_large_best.pt"> Finetuned checkpoint for SNLI-VE </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/image_gen_large_best.zip"> Finetuned checkpoint for Text-to-Image Generation on COCO && CLIP checkpoint && VQGAN checkpoint </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/imagenet_1k_large_best.pt"> Finetuned checkpoint for ImageNet-1K </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/gigaword_large_best.pt"> Finetuned checkpoint for Gigaword </a>
## Finetuning (OFA-Base)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_base_best.pt"> Finetuned base checkpoint for Caption on COCO </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_base_best.pt"> Finetuned base checkpoint for RefCOCO </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocoplus_base_best.pt"> Finetuned base checkpoint for RefCOCO+ </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocog_base_best.pt"> Finetuned base checkpoint for RefCOCOg </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/vqa_base_best.pt"> Finetuned base checkpoint for VQAv2 </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/snli_ve_base_best.pt"> Finetuned base checkpoint for SNLI-VE </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/image_gen_base_best.pt"> Finetuned base checkpoint for Text-to-Image Generation on COCO </a>
## Pretrained Language Models
To follow our multimodal pretraining, we suggest using pretrained language models for the initialization. Note that for the base-size and large-size models, we directly use BART-base and BART-large, and for the other sizes, we pretrained the tiny-size, medium size, and huge-size OFA-based language models.
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_tiny_plaintext.pt"> Tiny-size encoder-decoder language model (OFA) </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_medium_plaintext.pt"> Medium-size encoder-decoder language model (OFA) </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/bart_base.pt"> Base-size encoder-decoder language model (BART) </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/bart_large.pt"> Large-size encoder-decoder language model (BART) </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_huge_plaintext.pt"> Huge-size encoder-decoder language model (OFA) </a>
# Checkpoints (OFA-CN)
We provide checkpoints of OFA-CN, which is the Chinese version of OFA. We provide Base-size and Large-size models, including pretrained and finetuned models on image captioning and referring expression comprehension. Note that we translated the texts in the RefCOCO(-/+/g) datasets and finetuned OFA-CN on them. We plan to release the related new datasets in the near future.
<br>
## Checkpoints
Below we provide the links for downloading the Chinese OFA checkpoints.
### Pretraining
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_cn_large.pt"> Pretrained checkpoint (OFA-CN-Large) </a> (~443M parameters)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_cn_base.pt "> Pretrained checkpoint (OFA-CN-Base) </a> (~160M parameters)
### Finetuning (OFA-Large)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_cn_large.pt"> Finetuned checkpoint for MUGE Caption (Stage 1) </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_cn_large.pt"> Finetuned checkpoint for RefCOCO-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocoplus_cn_large.pt"> Finetuned checkpoint for RefCOCO+-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocog_cn_large.pt"> Finetuned checkpoint for RefCOCOg-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_cn_ocr_large.pt"> Finetuned checkpoint for Chinese OCR (multitask finetuned)</a>
### Finetuning (OFA-Base)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_cn_base.pt"> Finetuned checkpoint for MUGE Caption (Stage 1) </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_cn_base.pt"> Finetuned checkpoint for RefCOCO-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocoplus_cn_base.pt"> Finetuned checkpoint for RefCOCO+-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocog_cn_base.pt"> Finetuned checkpoint for RefCOCOg-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_cn_ocr_base.pt"> Finetuned checkpoint for Chinese OCR (multitask finetuned) </a>
<br>
## Model Card
Below we provide the basic information of the base-size and large-size OFA-CN.
<table border="1" width="100%">
<tr align="center">
<th>Model</th><th>#Params</th><th>Backbone</th><th>Hidden Size</th><th>Intermediate Size</th><th>#Heads</th><th>#Enc. Layers</th><th>#Dec. Layers</th>
</tr>
<tr align="center">
<td>OFA<sub>Base</sub><td>160M</td><td>ResNet101</td><td>768</td></td><td>3072</td><td>12</td><td>6</td><td>6</td>
</tr>
<tr align="center">
<td>OFA<sub>Large</sub></td><td>443M</td><td>ResNet152</td><td>1024</td></td><td>4096</td><td>16</td><td>12</td><td>12</td>
</tr>
</tr>
</table>
<br>
## Results
Below we provide the results of OFA-CN and the baselines for comparison.
### [MUGE Caption]("https://tianchi.aliyun.com/muge")
<table border="1" width="100%">
<tr align="center">
<td>Model</td><td>BLEU@4</td><td>ROUGE-L</td><td>CIDEr-D</td>
</tr>
<tr align="center">
<td>Trm </td><td>7.33</td><td>51.51</td><td>11.00</td>
</tr>
<tr align="center">
<td>M6</td><td>16.19</td><td>55.06</td><td>30.75</td>
</tr>
<tr align="center">
<td>OFA<sub>Base</sub></td><td>26.23</td><td>58.95</td><td>50.70</td>
</tr>
<tr align="center">
<td>OFA<sub>Large</sub></td><td><b>27.32</b></td><td><b>59.20</b></td><td><b>53.51</b></td>
</tr>
</table>
### RefCOCO-CN Series
<table border="1" width="100%">
<tr align="center">
<td>Model</td><td>RefCOCO(val/testA/testB)</td><td>RefCOCO+(val/testA/testB)</td><td>RefCOCOg(val/test-u)</td>
</tr>
<tr align="center">
<td>OFA<sub>Base</sub>(random-init)</td><td>30.13/35.07/25.03</td><td>17.89/20.90/15.83</td><td>20.30/20.45</td>
</tr>
<tr align="center">
<td>OFA<sub>Base</sub></td><td>82.18/86.07/<b>76.68</b></td><td>69.38/77.26/60.14</td><td><b>73.57/72.53</b></td>
</tr>
<tr align="center">
<td>OFA<sub>Large</sub></td><td><b>82.84/86.54</b>/76.50</td><td><b>71.30/78.56/61.85</b></td><td>71.96/71.30</td>
</tr>
</table>
<br>
# Colab Notebooks
We provide Colab notebooks of different downstream tasks for you guys to enjoy OFA. See below.
* [Image Captioning in Huggingface Transformers](https://colab.research.google.com/drive/1Ho81RBV8jysZ7e0FhsSCk_v938QeDuy3?usp=sharing)
* [Generic Interface](https://colab.research.google.com/drive/1jogyZ-2rdHU3XxZOf3TBfhex1XHqX-1m?usp=sharing#scrollTo=s9Vni6YUZOpC) (using different instructions to perform various tasks with just one model.)
* [Image Captioning](https://colab.research.google.com/drive/1Q4eNhhhLcgOP4hHqwZwU1ijOlabgve1W?usp=sharing)
* [Referring Expression Comprehension](https://colab.research.google.com/drive/1AHQNRdaUpRTgr3XySHSlba8aXwBAjwPB?usp=sharing)
* [Open-Domain Visual Question Answering](https://colab.research.google.com/drive/1lsMsF-Vum3MVyXwSVF5E-Y23rHFvj_3y?usp=sharing)
from .scst_loss import ScstRewardCriterion
from .label_smoothed_cross_entropy import AdjustLabelSmoothedCrossEntropyCriterion
from .clip_scst_loss import ClipScstRewardCriterion
from .label_smoothed_encouraging_loss import AdjustLabelSmoothedEncouragingLossCriterion
from .speech_pretrain_loss import SpeechPretrainLoss
# Copyright 2022 The OFA-Sys Team.
# All rights reserved.
# This source code is licensed under the Apache 2.0 license
# found in the LICENSE file in the root directory.
import math
from dataclasses import dataclass, field
from typing import Optional
from PIL import Image
from torchvision import transforms
import torch
import numpy as np
from fairseq import metrics
from fairseq.data import data_utils
from fairseq.criterions import FairseqCriterion, register_criterion
from fairseq.dataclass import FairseqDataclass
from fairseq import utils
from omegaconf import II
from models import clip
def custom_to_pil(x):
x = x.detach().cpu()
x = torch.clamp(x, -1., 1.)
x = (x + 1.) / 2.
x = x.permute(1, 2, 0).numpy()
x = (255 * x).astype(np.uint8)
x = Image.fromarray(x)
if not x.mode == "RGB":
x = x.convert("RGB")
return x
def scst_loss(lprobs, target, reward, ignore_index=None, reduce=True):
loss = -lprobs.gather(dim=-1, index=target.unsqueeze(-1)).squeeze() * reward.unsqueeze(-1)
if ignore_index is not None:
pad_mask = target.eq(ignore_index)
loss.masked_fill_(pad_mask, 0.0)
ntokens = (~pad_mask).sum()
else:
loss = loss.squeeze(-1)
ntokens = target.numel()
if reduce:
loss = loss.sum()
return loss, ntokens
@dataclass
class ClipScstRewardCriterionConfig(FairseqDataclass):
ignore_prefix_size: int = field(
default=0,
metadata={"help": "Ignore first N tokens"},
)
sentence_avg: bool = II("optimization.sentence_avg")
constraint_range: Optional[str] = field(
default=None,
metadata={"help": "constraint range"}
)
@register_criterion(
"clip_scst_reward_criterion", dataclass=ClipScstRewardCriterionConfig
)
class ClipScstRewardCriterion(FairseqCriterion):
CLIP_REWARD_WEIGHT = 2.5
def __init__(
self,
task,
sentence_avg,
ignore_prefix_size=0,
constraint_range=None
):
super().__init__(task)
self.sentence_avg = sentence_avg
self.ignore_prefix_size = ignore_prefix_size
self.constraint_start = None
self.constraint_end = None
if constraint_range is not None:
constraint_start, constraint_end = constraint_range.split(',')
self.constraint_start = int(constraint_start)
self.constraint_end = int(constraint_end)
def forward(self, model, sample, update_num=0, reduce=True):
"""Compute the loss for the given sample.
Returns a tuple with three elements:
1) the loss
2) the sample size, which is used as the denominator for the gradient
3) logging outputs to display while training
"""
loss, score, ntokens, nsentences = self.compute_loss(model, sample, reduce=reduce)
sample_size = (
nsentences if self.sentence_avg else ntokens
)
logging_output = {
"loss": loss.data,
"score": score,
"ntokens": ntokens,
"nsentences": nsentences,
"sample_size": sample_size,
}
return loss, sample_size, logging_output
def _calculate_clip_scores(self, gen_res, gt_text, device):
'''
gen_res: generated images, list of Image
gt_text: input captions.
device: device for clip model
'''
batch_size = len(gt_text)
gen_res_size = len(gen_res)
img_per_seq = gen_res_size // batch_size
hyp_images = torch.stack(
[self.task.clip_preprocess(gen_image) for gen_image in gen_res], dim=0
).to(device)
clip_input = clip.tokenize([text for text in gt_text]).to(device)
with torch.no_grad():
image_features = self.task.clip_model.encode_image(hyp_images)
text_features = self.task.clip_model.encode_text(clip_input)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
image_features = image_features.view(batch_size, img_per_seq, -1)
text_features = text_features.view(batch_size, 1, -1)
ti_similarity = image_features @ text_features.transpose(1, 2)
ti_similarity = ti_similarity.view(-1)
scores = self.CLIP_REWARD_WEIGHT * ti_similarity
return scores
def get_generator_out(self, model, sample):
model.eval()
with torch.no_grad():
self.task.scst_generator.model.eval()
gen_out = self.task.scst_generator.generate([model], sample)
gen_target = []
gen_res = []
gt_text = []
for i in range(len(gen_out)):
with torch.no_grad():
tokens = torch.stack([item['tokens'][:-1] for item in gen_out[i]], dim=0)
tokens += -len(self.task.src_dict) + self.task.cfg.code_dict_size + self.task.cfg.num_bins
images = self.task.image_tokenizer.decode_code(
tokens.view(-1, self.task.cfg.code_image_size // 8, self.task.cfg.code_image_size // 8)
)
images = [custom_to_pil(image) for image in images]
gen_target += [item['tokens'] for item in gen_out[i]]
gen_res += images
gt_text.append(
self.task.bpe.decode(
self.task.tgt_dict.string(
utils.strip_pad(sample['net_input']['src_tokens'][i], self.padding_idx).cpu().int()
)
)[38:] # remove task instruction.
)
return gen_target, gen_res, gt_text
def get_reward_and_scores(self, gen_res, gt_text, device):
batch_size = len(gt_text)
gen_res_size = len(gen_res)
img_per_sample = gen_res_size // batch_size
scores = self._calculate_clip_scores(gen_res, gt_text, device)
sc_ = scores.reshape(batch_size, img_per_sample)
baseline = (sc_.sum(1, keepdim=True) - sc_) / (sc_.shape[1] - 1)
# sample - baseline
reward = scores.reshape(batch_size, img_per_sample)
reward = reward - baseline
reward = reward.view(-1)
return reward, scores
def get_net_output(self, model, sample, gen_target):
def merge(sample_list, eos=self.task.tgt_dict.eos(), move_eos_to_beginning=False):
return data_utils.collate_tokens(
sample_list,
pad_idx=self.padding_idx,
eos_idx=eos,
left_pad=False,
move_eos_to_beginning=move_eos_to_beginning,
)
batch_size = len(sample["target"])
gen_target_size = len(gen_target)
img_per_sample = gen_target_size // batch_size
model.train()
sample_src_tokens = torch.repeat_interleave(
sample['net_input']['src_tokens'], img_per_sample, dim=0
)
sample_src_lengths = torch.repeat_interleave(
sample['net_input']['src_lengths'], img_per_sample, dim=0
)
sample_code_masks = torch.repeat_interleave(
sample['net_input']['code_masks'], img_per_sample, dim=0
)
gen_prev_output_tokens = torch.as_tensor(
merge(gen_target, eos=self.task.tgt_dict.bos(), move_eos_to_beginning=True),
device=sample["target"].device, dtype=torch.int64
)
gen_target_tokens = torch.as_tensor(
merge(gen_target), device=sample["target"].device, dtype=torch.int64
)
net_output = model(
src_tokens=sample_src_tokens, src_lengths=sample_src_lengths,
code_masks=sample_code_masks, prev_output_tokens=gen_prev_output_tokens
)
return net_output, gen_target_tokens
def get_lprobs_and_target(self, model, net_output, gen_target):
if self.constraint_start is not None and self.constraint_end is not None:
net_output[0][:, :, 4:self.constraint_start] = -math.inf
net_output[0][:, :, self.constraint_end:] = -math.inf
lprobs = model.get_normalized_probs(net_output, log_probs=True)
if self.ignore_prefix_size > 0:
if getattr(lprobs, "batch_first", False):
lprobs = lprobs[:, self.ignore_prefix_size :, :].contiguous()
gen_target = gen_target[:, self.ignore_prefix_size :].contiguous()
else:
lprobs = lprobs[self.ignore_prefix_size :, :, :].contiguous()
gen_target = gen_target[self.ignore_prefix_size :, :].contiguous()
return lprobs, gen_target
def compute_loss(self, model, sample, reduce=True):
gen_target, gen_res, gt_text = self.get_generator_out(model, sample)
reward, scores = self.get_reward_and_scores(gen_res, gt_text, device=sample["target"].device)
net_output, gen_target_tokens = self.get_net_output(model, sample, gen_target)
gen_lprobs, gen_target_tokens = self.get_lprobs_and_target(model, net_output, gen_target_tokens)
loss, ntokens = scst_loss(gen_lprobs, gen_target_tokens, reward, ignore_index=self.padding_idx, reduce=reduce)
nsentences = gen_target_tokens.size(0)
return loss, scores.sum(), ntokens, nsentences
@classmethod
def reduce_metrics(cls, logging_outputs) -> None:
"""Aggregate logging outputs from data parallel training."""
loss_sum = sum(log.get("loss", 0) for log in logging_outputs)
score_sum = sum(log.get("score", 0) for log in logging_outputs)
ntokens = sum(log.get("ntokens", 0) for log in logging_outputs)
nsentences = sum(log.get("nsentences", 0) for log in logging_outputs)
sample_size = sum(log.get("sample_size", 0) for log in logging_outputs)
metrics.log_scalar(
"loss", loss_sum / sample_size, sample_size, round=3
)
metrics.log_scalar(
"score", score_sum / nsentences, nsentences, round=3
)
metrics.log_scalar(
"ntokens", ntokens, 1, round=3
)
metrics.log_scalar(
"nsentences", nsentences, 1, round=3
)
metrics.log_scalar(
"sample_size", sample_size, 1, round=3
)
@staticmethod
def logging_outputs_can_be_summed() -> bool:
"""
Whether the logging outputs returned by `forward` can be summed
across workers prior to calling `reduce_metrics`. Setting this
to True will improves distributed training speed.
"""
return True
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment